The Infrastructure Nightmare Nobody Is Talking About

Coding agents are speeding up product work, but the pressure is shifting to data and infrastructure platforms where review, reliability, and operations must…

Coding agents are not only producing more features; they are moving the bottleneck down into the systems that have to run them. In this conversation, Emma, who leads Data Platform Infrastructure Engineering at OpenAI, describes a practical tension: product teams are accelerating with Codex, while platform teams must absorb more workloads, pull requests, and support requests without lowering reliability standards.

Where the bottleneck moves

OpenAI’s data platform supports analytics, streaming, event buses, ML infrastructure, feature stores, secure data movement, training data, and eval data. In practice, nearly every team depends on it. As agents make application teams faster, the platform becomes the pressure point.

Emma points to workflows already in use: release automation, Slack triage, PRs with evidence and videos, internal support, job debugging, and infrastructure knowledge packaged as skills. These save time, but they also expose an asymmetry: an alpha application can tolerate imperfection; a Spark or Kafka cluster cannot.

Defense in depth for agentic engineering

The hard part is not just generating code. The system also has to review it, deploy it, isolate it, and operate it safely. Emma argues for a multi-agent architecture: one agent writes code, specialized agents review it, and operational agents monitor and intervene. That looks like an upgraded form of code ownership, backed by knowledge bases, tool calls, skills, and internal evals.

The risk is that goal-driven agents may touch internal APIs, generate Flink or Spark workloads users do not understand, or trigger incidents in shared systems. Platform teams therefore need stronger interfaces, clearer boundaries, and defensive mechanisms before automation is allowed to act directly in production.

What teams can do now

Emma’s near-term advice is to buy back time: support bots, AGENT.md files, skills, isolated trials, and private eval suites. Even a simple eval suite maintained in Notion or an internal document can help teams decide when a new model is ready for a specific workflow.

The strategic signal is clear: agent adoption cannot remain concentrated at the application layer. For productivity gains to last, infrastructure, code review, and operations must become agentic as well — but with much stricter guardrails.

Source