Aengus McGuinness

Software engineer working on systems and infrastructure. Currently building path-critical infrastructure at Axilon. Previously at Los Alamos National Laboratory and the Harvard MCB department, where I spent four years on computational biology and high-performance computing.

I’m most interested in the places where careful engineering meets hard real-world problems — distributed systems, low-latency networking, ML infrastructure, and the layers below.

github / email / resume

Now

Axilon

2026 —

Software Engineer, New York

Joining as an early engineer on path-critical infrastructure. Axilon brings LLMs into industrial operations and engineering workflows; the work sits at the intersection of applied ML, reliability engineering, and customer-specific deployment. Backed by BoxGroup and RTX Ventures.

Before

Harvard Generative AI Research Program

Summer 2025

Research Fellow · Cambridge, MA

Designed distributed deep learning pipelines on GPU clusters under SLURM. Integrated Bayesian optimization into existing OpenFold workflows, achieving 2× validation improvement while keeping execution scalable across compute nodes.

Los Alamos National Laboratory, B-GEN

2022 – 2024

Student Researcher · Santa Fe, NM

Optimized distributed HPC pipelines processing terabyte-scale biological datasets on SLURM clusters. Reduced workflow runtime by 80% through parallelization, memory tuning, and GPU-accelerated data processing modules.

Selected Projects

RCache-RDMA

2026

One-sided RDMA distributed cache · C++, libibverbs · report

A key-value cache with three communication paths — TCP/RPC, two-sided RDMA, and one-sided RDMA reads over a registered hash-table memory region. The one-sided path bypasses server CPU entirely.

On CloudLab Mellanox hardware: 974k ops/s with stable 12 μs p99 latency on one-sided reads — a 13× throughput improvement and 20× tail-latency reduction over the TCP baseline. Adding RDMA FETCH_AND_ADD atomics for recency tracking imposes a consistent 3–4 μs p99 penalty, isolating the cost of cache-policy maintenance on the read path.

RDMAdistributed systemsperformance

Adaptive Stream Buffer Prefetcher

2026

Hardware prefetcher simulation · C++, Intel Pin · report

A two-phase study of hardware prefetching via stream buffers: Jouppi’s original fixed-depth design and Palacharla & Kessler’s adaptive extension, both implemented as Intel Pin tools with a two-level cache hierarchy and explicit latency model.

The adaptive policy learns stream-length distributions online via a histogram and chooses prefetch depth dynamically. Across SPEC CPU2006 workloads it achieves 5.02× speedup on libquantum and 83% prefetch accuracy on dealII (versus 78% for static next-line), while reducing wasted memory bandwidth on irregular workloads by an order of magnitude. Total hardware cost: 250–400 bytes.

computer architecturesimulationSPEC CPU2006

High-Performance Async RPC Client

2026

Distributed systems project · C++, gRPC

An asynchronous RPC client built around gRPC completion queues and a multi-threaded polling architecture. Increased throughput from ~8k to 55k+ RPC/s via batching, non-blocking I/O, and flow-control tuning.

networkingconcurrencygRPC

Writing

Coming soon.