Carlos Chinchilla Corbacho

About

What I
work on

Most of my work has been about getting ML out of notebooks and into production — the unglamorous half where models meet real users, real latency, and real consequences. LLM agents make that work harder in ways traditional ML never did. That’s what I’m researching now.

On evaluation

“Most of what an LLM agent does in production has never appeared in any evaluation set.”
Carlos Chinchilla Corbacho
Carlos Chinchilla Corbacho · 2026

Identity

Who I am

Stable identifiers — for citations, search engines, and other Carloses.

Full name
Carlos Chinchilla Corbacho (cite as “Chinchilla Corbacho, C.”)
ORCID
0009-0001-4495-8179 — canonical identifier
GitHub
@cchinchilla-dev — note the -dev suffix; other GitHub users named “Carlos Chinchilla” are different people.
Affiliations
QUANT AI Lab · Inditex (Senior ML & AI Engineer); Universidad de Salamanca (PhD candidate, expected 2027).
Open source
Maintainer of agentloom and agentanvil; merged contributor to a2a-python and a2a-go (Linux Foundation A2A Protocol, 1.0 stable).

Get in touch

Reach me

Best for ML/agent work, talks, collabs.

The work

Career

Where I’ve been building.

  • 2025 —
    Senior ML Engineer · QUANT AI Lab
    Same team, new challenge: agentic AI. Architecting the multi-agent (A2A) systems layer at Inditex — automating decisions across functional areas at global e-commerce scale.
  • 2024 — 25
    ML Engineer · QUANT AI Lab
    Embedded in Inditex's Experience AI team. Real-time personalisation across all global e-commerce markets — inference under traffic spikes, experimentation that holds in production.
  • 2023 — 24
    ML Engineer · Pontifical University of Salamanca
    DL for lithium battery second-life on Edge/IoT — SoH prediction contributing to a 35% reuse rate in energy storage.
  • 2022 — 23
    Data Scientist · Telefónica Foundation
    ProFuturo programme — predictive models reducing student dropout in large-scale education.

In other words

Why I write here

The longer answer.

What you’ll find
  • 01Evaluation harnesses, in detail
  • 02Failure modes from production
  • 03Notes on debugging multi-agent runs
  • 04Occasional opinionated takes

I work on the systems around language models — evaluation harnesses, trace and replay infrastructure, the things that let an agent be debugged, audited, and trusted in production. Most of it is engineering, not modelling.

That distinction matters. ML research and ML production look superficially similar — they share vocabulary, papers, even people — but they have different failure modes, different value functions, different ideas of what “done” means. A model that scores well on a benchmark can still leak data, hallucinate confidently, or regress silently after a deploy.

This site is the long form of that work. Short pieces, real numbers, code where useful. If you ship ML systems — LLM agents or otherwise — and care about reliability, you’ll find familiar problems here.