Building distributed AI infrastructure that unifies GPU clusters and cloud providers — enabling researchers to schedule and execute large-scale training workloads across heterogeneous compute environments.
Aengus McGuinness
I build C++ systems where latency matters. Currently at Ironsite in San Francisco — distributed GPU clusters, RDMA fabric, cross-cloud scheduling for AI research teams running large training workloads. Previously at Los Alamos National Laboratory and the Harvard MCB department, where I spent four years on computational biology and high-performance computing.
I’m most interested in problems where p99 is the constraint — low-latency networking, lock-free data structures, query engines, inference serving, and the layers below.
Now
Before
Designed distributed deep learning pipelines on GPU clusters under SLURM. Integrated Bayesian optimization into existing OpenFold workflows, achieving 2× validation improvement while keeping execution scalable across compute nodes.
Optimized distributed HPC pipelines processing terabyte-scale biological datasets on SLURM clusters. Reduced workflow runtime by 80% through parallelization, memory tuning, and GPU-accelerated data processing modules.
Selected Projects
libibverbsA key-value cache with three communication paths — TCP/RPC, two-sided RDMA, and one-sided RDMA reads over a registered hash-table memory region. The one-sided path bypasses server CPU entirely.
On CloudLab Mellanox hardware: 974k ops/s with stable 12 μs p99 latency on one-sided reads — a 13× throughput improvement and 20× tail-latency reduction over the TCP baseline. Adding RDMA FETCH_AND_ADD atomics for recency tracking imposes a consistent 3–4 μs p99 penalty, isolating the cost of cache-policy maintenance on the read path.
A two-phase study of hardware prefetching via stream buffers: Jouppi’s original fixed-depth design and Palacharla & Kessler’s adaptive extension, both implemented as Intel Pin tools with a two-level cache hierarchy and explicit latency model.
The adaptive policy learns stream-length distributions online via a histogram and chooses prefetch depth dynamically. Across SPEC CPU2006 workloads it achieves 5.02× speedup on libquantum and 83% prefetch accuracy on dealII (versus 78% for static next-line), while reducing wasted memory bandwidth on irregular workloads by an order of magnitude. Total hardware cost: 250–400 bytes.
An asynchronous RPC client built around gRPC completion queues and a multi-threaded polling architecture. Increased throughput from ~8k to 55k+ RPC/s via batching, non-blocking I/O, and flow-control tuning.
Implemented blocking system calls (sys_waitpid, sys_msleep) in the Chickadee teaching kernel by redesigning scheduler interactions and eliminating busy-wait loops.
Built a timer-driven sleep mechanism using a hashed timing structure, maintaining correctness under multi-core execution and preventing lost wakeups and race conditions.
Writing
Coming soon.