The archive · 6 pieces
Writing
Long essays, working notes, and series on agent reliability, evaluation, and the EU AI Act. Written from inside production.
Deterministic orchestration around non-deterministic models
How to build agent workflows you can replay, diff, and certify — when the underlying LLM call is none of those things.
Agent security is not model security
Why prompt-injection benchmarks tell you almost nothing about whether your agent is safe to deploy — and what to test instead.
The Article 15 testing gap
What "appropriate level of accuracy, robustness and cybersecurity" means in practice for LLM-based agents — and why most teams have no answer.
End-to-end evals for agentic systems
Unit tests and benchmarks miss the failures that actually break agents. A pattern for evaluating the system as a whole.
ML engineering is not ML research
A note for engineers transitioning into ML — and for researchers wondering why their prototype keeps falling over in production.
How I read papers as a working engineer
A practical workflow for staying current with ML research without making it a second job.