New: Build your software factory with Tessl AgentLearn more

PODCAST EPISODE 109

OpenAI's Framework for Shipping Code at 70 PRs/Week

Ryan Lopopolo built a product at OpenAI with zero human-written code, and by the time his team reached its seventh engineer, new hires were making the team faster within two weeks.

9 Jun 202655 min 27 sec

AI Coding Tools

Transcript

In this episode

Ryan Lopopolo's team at OpenAI shipped an entire product with no human-written code - and onboarding new engineers made the team faster within two weeks. That's not a thought experiment. That's what Harness Engineering looks like in practice.

Ryan is a Member of Technical Staff at OpenAI and the person who coined the term Harness Engineering. In this conversation, recorded live at AI Native DevCon London 2026, he breaks down the systems, constraints, and feedback loops that make coding agents reliable enough to trust - and what it actually takes to get to a billion tokens a day.

What we cover:

What Harness Engineering is, and how it differs from context engineering and prompt engineering
The no-human-written-code constraint: why Ryan imposed it and how it changed the team
Building trust in agent output: Friday garbage collection, anti-slopification CI loops, and reviewer agents
Inverting spec-driven development: why it's easier to build first and distill the spec second
PR throughput going from 3.5 to 70 per engineer per week across model versions
What engineering teams look like when Harness Engineering becomes the default

Tessl: https://tessl.io | Subscribe for weekly episodes on AI-native development

What's your current PR throughput - and do you think Harness Engineering is the missing piece to scaling it? Drop your take in the comments.

Harness Engineering: How OpenAI Ships Code Without Writing It

There is a moment in this conversation where Ryan Lopopolo describes discovering that a teammate had completely replaced the MCP connection in his codebase with a TypeScript daemon - while he was using the codebase, without interrupting his workflow, and without telling him. His reaction: "I found it shocking."

That story captures something important about where software engineering is heading. Lopopolo, a Member of Technical Staff at OpenAI and the person who coined the term Harness Engineering, joined The AI Native Dev at AI Native DevCon London 2026. What followed was one of the most practical breakdowns of production-grade agent development the show has produced.

What is Harness Engineering?

Harness engineering is the discipline of setting up agents to do highly complex, autonomous software engineering work - producing code that is acceptable, trustworthy, and mergeable - by engineering the systems around them rather than the agents themselves.

The core insight is that the only two levers you have when working with a coding agent are context and tools. Harness engineering is a deliberate combination of both. It means getting the right context into the repository where agents can find it, surfacing that context to the agent at the right moment, and closing the feedback loop with reviewer agents, tests, linters, and refactoring cycles that let the agent prove its output meets the bar.

Lopopolo noted that this framing draws on his experience leading developer productivity for 350 engineers at Brex. "When I'm trying to improve productivity for an engineering organization of 350 engineers, I can't be hands-on in the weeds of every PR," he explained. "I kind of have to steer from the background, make sure that the right systems are in place by default." With coding agents, the same mental model applies - except the agents are the engineers.

The No-Human-Written Code Constraint

In June 2025, before GPT-5 had shipped, Lopopolo imposed a constraint on his team: no human-written code. This was, by his own description, a radical idea at the time.

What followed was a period of deliberate friction. Early Codex was far less capable, and Lopopolo found himself acting as what he called a "chunky tool" - a human the agent could delegate to when it got stuck. But the friction was generative. Every time the agent asked him to do the same thing twice, he built a tool call that eliminated the ask. That habit of paying close attention to where his own time was going, and systematically automating it away, became the foundation of harness engineering.

As the team grew, something unexpected happened. Each new hire made the team faster within two weeks - rather than slower for the first few months, as onboarding typically goes. The reason: because Codex was the sole entry point to the codebase, every new hire immediately got the accumulated judgment of the entire team, already embedded in the harness. "They're already there," Lopopolo observed, "which means new hires were able to very quickly supply their best judgment and context."

Building Trust Through Quality Loops

Trust in agent output didn't arrive all at once. Lopopolo described two distinct phases.

The first was trusting the agent to produce code at all. That moment crystallised while working on an early version of what became the Codex app: giving a prompt and seeing the feature in the real world half an hour later. "You kind of got this feeling of confidence that it really is down to my capacity for scheduling prompts."

The second phase was trusting that the code was high quality. This arrived by necessity when the team's third engineer joined and PR throughput increased faster than the team could review. Their solution was structured around a concept Lopopolo calls the anti-slopification loop: a Friday "garbage collection" session to identify and systematically eliminate recurring quality issues, which then fed into programmatic guardrails and automated CI jobs that could detect and propose fixes for the same class of problem going forward. "We never wanted to give the same feedback twice," he noted.

The CI jobs would spider through the codebase, identify divergences from the team's golden principles, and propose PRs. A human would supervise and provide thumbs up or down. The next time the loop ran, it would ingest that feedback, compare it against session logs, and update its own understanding of what good looked like.

Spec-Driven Development, Inverted

One of the more counterintuitive frameworks Lopopolo described concerns the relationship between specs and code. Traditional spec-driven development says: write the spec, refine it as code gets produced, suss out the ambiguity. His experience building Symphony - OpenAI's "ghost library," a spec that can be implemented in any codebase - suggested the opposite sequence often works better.

"I've actually found it is much easier to produce the code first, to have that code sort of like present a straw man for what the world could look like, engage with that as a team to refine it, and then distill the spec out of that." The reason is that produced artifacts - code, spreadsheets, documents - are information-dense. They contain, implicitly, all the decisions made to produce something the team accepted as good.

For Symphony specifically, the team built a three-phase pipeline: give the implementation to an agent and ask it to produce a spec; give only that spec to a second agent and ask it to implement the system; give both the derived implementation and the original to a third agent acting as judge, which would identify misalignments and propose changes to the spec. Iterate until the spec reliably reproduces the system.

From 3.5 to 70 PRs Per Engineer Per Week

The numbers Lopopolo shared on model progression are striking. At the beginning of the GPT-5.2 era, his team was producing around 3.5 PRs per engineer per week. With GPT-5.5, that figure had reached 70. "That's more than linear scaling," he observed. Each revision of the model compounded on the last, and the team's harness was already in place to absorb each improvement immediately.

That trajectory informs the claim that attracted some controversy: that it is "borderline negligent" not to use a billion tokens a day. Lopopolo's argument is that intelligence extraction scales roughly linearly with token consumption. Getting to a billion tokens a day requires thinking well beyond pair programming - it requires asynchronous loops, parallel agent runs, and automations that can side-effect across an entire organisation. The harness is what makes that scale possible without producing chaos.

What Engineering Looks Like Next

When asked what an engineering team looks like when harness engineering is the default, Lopopolo described a rebalancing of time rather than a reduction in engineers. His own time had shifted from 50-70% code production to roughly 30% each across hard refactors and zero-to-one ideation, customer conversations, and scheduling and staffing work. "I am able to step back a bit and focus on these other highly cross-functional, high priority work streams."

The skill that matters most in that world, he argued, is systems thinking - the ability to peer six months into the future, set teams up for success, and define the interfaces and components of a system rather than its implementation. The engineer who excels is not the one who writes the most code but the one who most clearly sees what the system should become and can communicate that to the agents building it.

That shift is already underway on the teams building at the frontier. The conversation is worth a listen for anyone trying to understand what comes next.

AI Coding Tools

CHAPTERS

OpenAI's Framework for Shipping Code at 70 PRs/Week