System and Engineering Process KPIs

By Vlad Zams. 17 January 2024

1. System KPIs

Surprisingly there is only one kind of technical KPIs that I’ve seen used in practice: SLO (Service Level Objectives). SLO used to be used primarily for high-load or highly-distributed (microservice-based) backend software. SLOs are embodied in SLA – service level agreement – and is essentially a set of certain requirements for certain metrics that a backend application should always conform to. An example of one of such a metric can be:

99.9999% of all the GET requests coming to [/emr] end-point should dispatch a response within no more than 0.1 seconds. Where the moment of “dispatch” refers to writing the output data to a network socket looking outside of our own network (where all of our balancers, caches and services live).

And, quite obviously, SLO includes a number of such requirements, not just one. Along with that most of the end-users use APIs not directly, but only through client applications: mobile apps, browser apps (aka “front-end” apps for desktop or mobile browsers), and standalone desktop apps. And what’s surprising here is that I don’t remember I’ve seen anything similar to SLO/SLA used for a mobile or a frontend app. The only exception I can remember was a requirement related to the startup time. E. g. app should launch or be rendered in a browser window in no more than 2.5 seconds. And if that’s the only thing we have for our client-side SLO, then I’d say, it’s a pretty shallow SLO. I can’t remember anything that would resemble SLO from a usability standpoint. So the questions that rose for me were like:

Could we want that at least 50% of our users would be able to complete a certain action using the app within 30 seconds after entering the 1st screen in the flow?
Could we want users to be able to refer to support not to take more than 3 clicks from at least 80% of the screens of our app?

I’d say “hell, yeah!“. And what does your experience tell?

2. Engineering KPIs

Another matter of fact that I was often struck by is how blindly certain KPIs of the engineering process itself are selected.

Here the “line coverage”, as a metric, deserves a separate mention. While any metric can be misleading if applied continuously without revisiting whether the goals that inspired that metric are still relevant. However the line coverage is a bright example since it seems to be abused way often. My view is that line coverage correlates with the quality of the software or the engineering process more or less the same as the number of screws touched during assembly correlates with the quality of a car or the process of building a car. Please, don’t get me wrong, in this car example it’s clear there is a positive correlation between screws checked and the car (and process) quality. However that metric seems to be far less indicative when compared to something like

How many bugs a new team member fixes on average within the first 2 months of work?
How much time it takes for an engineer (on average across all on a team) to pull the recent 2-months usage data for whatever she or her team is building (e. g. an API service, a GUI-app feature).
How many suggestions coming from the engineering team are discussed at a product or business levels with the results reported back to the engineering team.
How many suggestions an engineer comes up with on average on a monthly basis?

I’m not sure how much contrast can be seen here right away. Let me elaborate on a few perspectives by putting some more questions side by side as additional examples:


(1) “What’s our current unit-test coverage?”	(a) “Which app’s screen do experienced users go most frequently to upon launching the app?”

Or, if we’re more used to an environment where engineers are less expected to analyze more of a product aspects, and instead are more into technical aspects, let’s compare these two:


(2) “How many unit tests we introduced this week?”	(b) “What is the most dependent module/class of our code base?”

All of those questions are about engineering, again, assuming that you accept question (a) which, while touching UX, is still about the outcomes of how the app is engineered.

Now, please take a moment and imagine an engineering leader is stating question #2. And it doesn’t matter, whether it’s a senior engineer, an engineering manager, or a tech. lead. What a team may imply when hearing such a question? Even without necessarily noticing it, one of the implications would be, well, “the number of weekly added unit tests matters”. Obviously, it’s at least due to that somebody with a certain amount of authority states that kind of question.
Developing this line of thought, what kind of inspiration does such question introduce? Well, probably thinking more about unit tests, probably on a weekly basis.

And while adding unit tests may actually improve the familiarity with the code base and bug discovery later on, emphasizing the frequency of writing such (which, btw, the question (2) certainly does) may not lead to these positive results.

But the more important general point I’m making is that many questions, and I’d argue, nearly all of the questions asked by people holding a higher organizational authority (i. e. formal managers) are the leading questions. And therefore, questions about certain activities are rather leading to those activities being repeated or avoided. Questions about certain state are rather leading toward ensuring that state or changing that state.

… questions about certain activities are rather leading to those activities being repeated or avoided. Questions about certain state are rather leading toward ensuring that state or changing that state.

3. Effective KPIs

(A copy of the questions above for having them near by for the reference)


(1)“What’s our current unit-test coverage?”	(a) “Which app’s screen do experienced users go most frequently to upon launching the app?”
(2) “How many unit tests we introduced this week?”	(b) “What is the most dependent module/class of our code base?”

It’s clear that the question (2) is about activity rather than about the state. Apart from that not all the quantity translates into a certain type of quality, asking about the activity versus the state bears some risks on its own given that a number of books on management suggest to be formulating the desired outcomes for someone’s work rather than instructing on how to do that work. (Of course, there are some nuances, e. g. certain activities and repetition can deserve much higher attention during training. However, right now we’re talking about a regular production process, not a dedicated training.)

In that sense question (1) is more suitable because it’s about the state – a particular outcome of the engineering process. That said, question (1) still doesn’t directly inspires or anyhow implies that we as engineers would think about:

What affects the production code
How we interact with each other (within the engineering team)
How we interact with other teams, for instance, how we process input from the users, or the business, how we understand what’s more important, etc.

And here we can see how questions (a) and (b) are practically different. They are both about the state, and they are both about the state of a production system. And those are the questions that actually inspire the engineering process to change meaningfully.

So, a potentially counterintuitive point here is that while, compared to questions (1) and (2), questions (a) and (b) per se are both less about activities performed exclusively by engineers, they imply more of what we may actually want out of the engineering process.

You may have a fair argument that finding out what is the most dependent module (or class doesn’t directly affect the production environment. I get it. To clarify, the point behind that one was to allow to shed more light on a quality of our system. Just as we could by asking “what is the least tested code we have?“. And if we imagine different follow-up questions, we may easily see this conversation turning to actionable points that would affect how we will deal with those parts of the system. And the answer may be far from “covering it with tests up to X%“.

Btw, I’m not saying that there are no simple to measure, purely technical metrics. There may be. For example, I’d say that a local build time (which can be even easier to measure than the line-coverage) has a better correlation with the quality of the engineering process. And it’s much harder to cheat on its value compared to line coverage.

Final Thoughts

Implications of a question (any question) fundamentally is the subject for speculation. A particular team may not think one way or another, especially when the contexts are different. E. g. question asked 1 time a year may still be leading, but not nearly as leading as the questions asked weekly.

And there is no silver bullet or best-for-all KPIs. That is, any set of KPIs can be misused when followed dogmatically. At least because the business and the market landscape changes. We just can’t keep up with the game unless we revise what we optimize for. Good luck!

References

Chapter 4 in “First, Break All The Rules” by Marcus Buckingham & Curt Coffman (See on Wikipedia)
Chapter 6 in “High Output Management” by Andrew S. Grove

P. S. “EAWF” stands for “exclusively abstract word fog”