There is an uncomfortable conversation happening in engineering leadership circles right now, and most organizations are not having it openly enough. It goes roughly like this: we have deployed AI coding assistants across the team, commit velocity has increased, lines of code per engineer are up, pull request volume has climbed, and yet something feels off. The engineers who seem most thoughtful about what they are building are not necessarily the ones driving those metrics. Meanwhile, some of the most active committers are shipping code that requires repeated revision, creates architectural debt, or solves problems that did not need solving in the first place.

This is not a new tension. Engineering leadership has always struggled with the difference between activity and output, and between output and value. But AI coding tools have compressed that struggle into a much sharper and more immediate form. When a tool can generate a working implementation in minutes, the act of writing code stops being the bottleneck. And when writing code stops being the bottleneck, the metrics we built around code production start telling us less and less about who our best engineers actually are.
The question this creates for engineering organizations is genuinely difficult: if AI can generate code, review code, write tests, explain APIs, and implement features, how do you know who your best engineers are? How do you evaluate performance fairly? How do you avoid accidentally rewarding the wrong behaviors at exactly the moment when getting this right matters most?
This article is an attempt to think through that question seriously.
How We Ended Up Measuring What Is Easy to Count
Before examining what metrics should look like in an AI-assisted world, it is worth understanding why the current ones exist and why they became so entrenched.
Read on →


