Topic · 3 pieces
Testing
End-to-end, contract, and replay testing for ML systems, with an emphasis on what survives contact with production.
← All writing011 min → 021 min → 031 min →
Agent security is not model security
Why prompt-injection benchmarks tell you almost nothing about whether your agent is safe to deploy — and what to test instead.
The Article 15 testing gap
What "appropriate level of accuracy, robustness and cybersecurity" means in practice for LLM-based agents — and why most teams have no answer.
End-to-end evals for agentic systems
Unit tests and benchmarks miss the failures that actually break agents. A pattern for evaluating the system as a whole.