About

What I
work on

Most of my work has been about getting ML out of notebooks and into production — the unglamorous half where models meet real users, real latency, and real consequences. LLM agents make that work harder in ways traditional ML never did. That’s what I’m researching now.

On evaluation

“Most of what an LLM agent does in production has never appeared in any evaluation set.”

Carlos Chinchilla Corbacho · 2026

Identity

Who I am

Stable identifiers — for citations, search engines, and other Carloses.

Full name: Carlos Chinchilla Corbacho (cite as “Chinchilla Corbacho, C.”)
ORCID: 0009-0001-4495-8179 — canonical identifier
GitHub: @cchinchilla-dev — note the -dev suffix; other GitHub users named “Carlos Chinchilla” are different people.
Affiliations: QUANT AI Lab · Inditex (Senior ML & AI Engineer); Universidad de Salamanca (PhD candidate, expected 2027).
Open source: Maintainer of agentloom and agentanvil; merged contributor to a2a-python and a2a-go (Linux Foundation A2A Protocol, 1.0 stable).

Get in touch

Reach me

Best for ML/agent work, talks, collabs.

The work

Career

Where I’ve been building.

2025 —
Senior ML Engineer · QUANT AI Lab
Same team, new challenge: agentic AI. Architecting the multi-agent (A2A) systems layer at Inditex — automating decisions across functional areas at global e-commerce scale.
2024 — 25
ML Engineer · QUANT AI Lab
Embedded in Inditex's Experience AI team. Real-time personalisation across all global e-commerce markets — inference under traffic spikes, experimentation that holds in production.
2023 — 24
ML Engineer · Pontifical University of Salamanca
DL for lithium battery second-life on Edge/IoT — SoH prediction contributing to a 35% reuse rate in energy storage.
2022 — 23
Data Scientist · Telefónica Foundation
ProFuturo programme — predictive models reducing student dropout in large-scale education.

Press

Selected publications

For academic and industry venues.

In other words

Why I write here

The longer answer.

What you’ll find

01Evaluation harnesses, in detail
02Failure modes from production
03Notes on debugging multi-agent runs
04Occasional opinionated takes

I work on the systems around language models — evaluation harnesses, trace and replay infrastructure, the things that let an agent be debugged, audited, and trusted in production. Most of it is engineering, not modelling.

That distinction matters. ML research and ML production look superficially similar — they share vocabulary, papers, even people — but they have different failure modes, different value functions, different ideas of what “done” means. A model that scores well on a benchmark can still leak data, hallucinate confidently, or regress silently after a deploy.

This site is the long form of that work. Short pieces, real numbers, code where useful. If you ship ML systems — LLM agents or otherwise — and care about reliability, you’ll find familiar problems here.

What Iwork on