Topic · 3 pieces

Testing

End-to-end, contract, and replay testing for ML systems, with an emphasis on what survives contact with production.

Why prompt-injection benchmarks tell you almost nothing about whether your agent is safe to deploy — and what to test instead.

What "appropriate level of accuracy, robustness and cybersecurity" means in practice for LLM-based agents — and why most teams have no answer.

Unit tests and benchmarks miss the failures that actually break agents. A pattern for evaluating the system as a whole.