Aengus McGuinness

Software engineer working on systems and infrastructure. Currently building path-critical infrastructure at Axilon. Previously at Los Alamos National Laboratory and the Harvard MCB department, where I spent four years on computational biology and high-performance computing.

I’m most interested in the places where careful engineering meets hard real-world problems — distributed systems, low-latency networking, ML infrastructure, and the layers below.

Now

Axilon
Software Engineer, New York

Joining as an early engineer on path-critical infrastructure. Axilon brings LLMs into industrial operations and engineering workflows; the work sits at the intersection of applied ML, reliability engineering, and customer-specific deployment. Backed by BoxGroup and RTX Ventures.

Before

Harvard Generative AI Research Program
Research Fellow · Cambridge, MA

Designed distributed deep learning pipelines on GPU clusters under SLURM. Integrated Bayesian optimization into existing OpenFold workflows, achieving validation improvement while keeping execution scalable across compute nodes.

Los Alamos National Laboratory, B-GEN
Student Researcher · Santa Fe, NM

Optimized distributed HPC pipelines processing terabyte-scale biological datasets on SLURM clusters. Reduced workflow runtime by 80% through parallelization, memory tuning, and GPU-accelerated data processing modules.

Selected Projects

RCache-RDMA
One-sided RDMA distributed cache · C++, libibverbs · report

A key-value cache with three communication paths — TCP/RPC, two-sided RDMA, and one-sided RDMA reads over a registered hash-table memory region. The one-sided path bypasses server CPU entirely.

On CloudLab Mellanox hardware: 974k ops/s with stable 12 μs p99 latency on one-sided reads — a 13× throughput improvement and 20× tail-latency reduction over the TCP baseline. Adding RDMA FETCH_AND_ADD atomics for recency tracking imposes a consistent 3–4 μs p99 penalty, isolating the cost of cache-policy maintenance on the read path.

Adaptive Stream Buffer Prefetcher
Hardware prefetcher simulation · C++, Intel Pin · report

A two-phase study of hardware prefetching via stream buffers: Jouppi’s original fixed-depth design and Palacharla & Kessler’s adaptive extension, both implemented as Intel Pin tools with a two-level cache hierarchy and explicit latency model.

The adaptive policy learns stream-length distributions online via a histogram and chooses prefetch depth dynamically. Across SPEC CPU2006 workloads it achieves 5.02× speedup on libquantum and 83% prefetch accuracy on dealII (versus 78% for static next-line), while reducing wasted memory bandwidth on irregular workloads by an order of magnitude. Total hardware cost: 250–400 bytes.

High-Performance Async RPC Client
Distributed systems project · C++, gRPC

An asynchronous RPC client built around gRPC completion queues and a multi-threaded polling architecture. Increased throughput from ~8k to 55k+ RPC/s via batching, non-blocking I/O, and flow-control tuning.

Writing

Coming soon.