Your top performers may actually be hurting your organization

The Borys
9 min readDec 5, 2018

--

Common practices are always a good target for a discussion. Especially, when they are about people management. Let’s discuss how local optima and its best practices may lead your team to an eternal disaster.

Isolated improvements are known as “local optima” (or in traditional manufacturing contexts, “local efficiencies”). A local optimum is whatever is best for the performance of an individual part, whereas the global optimum is what is best for the performance of the system as a whole.

The Case: One thinks that resources and people shouldn’t be idle, which leads to performance metrics per a team or even individual. The situation may be extended with the idea that if there is extra capacity which is always idle, we should cut it to optimize the system cost. You may not recognize it, but here we are talking about the universal rule of modern workspace — “stay busy.”

Delivery pipeline and organization structure

So, let’s imagine an unrealistic but pretty determined system. (I want to use some software development pipelines, although that may reduce the clarity of explanation due to its variety). Here we have three teams which are working on pretty similar projects all the time.

Project machine structure

The backend team should make some changes in the code, then pass it to the UI team, and then some tests should be run by the QA team (well, QA automation team, who fixes all the bugs which were found. There is no need to return a project to the backend or UI in our case). Each stage has its own quality gates, and they work inside a team. (We will not discuss how ineffective such a pipeline could be from DevOps perspective or any other methodology — that is not the topic today. Let’s prove or disprove some assumptions about that system and discuss how it behaves).

Assume the following conditions are true

  • The system is stable
  • Teams have zero retention
  • People are happy
  • People can’t get sick or burn out
  • People or team’s work takes the same amount of time
  • (each change at each stage takes the same amount of time, always)
  • No need of iterations to a previous team
  • Projects are the same in size
  • No unpredictable human behaviors
  • Each Team has robust quality gates
  • Neither people’s health nor mood impacts the quality gates
  • Nothing can impact product quality here

With the deterministic simulation, we can more naturally highlight processes which we would normally hide behind excuses related to people and their unpredictable behaviors.

Here is the high-level diagram of the pipeline.

Pipeline and its characteristics
  • Note, if a team passes a project to each other directly (once they have done their part), it takes zero time to start working on it because a team is still keeping in mind all the details of what should be done, and it takes zero abstract minutes for UI and QA teams to start working.
  • However, if the UI team decides to take a project from the backlog, it takes extra time to get into the project (we can add here time communicating with managers, prioritizing and refreshing details). Another simplicity will be this: the backend team does not need to be involved in meetings with the UI team (in case of work with backlog).

Note: All these numbers are devised. You can experiment by using days or weeks in the same way, or put in any other numbers. I choose these by simplicity.

So, I propose to simulate the work in a very straightforward way.

Pipeline simulation

Hopefully, you are not afraid of a pre-filled table with numbers. Let me explain them. (TIP: This may require extra attention, but you can skim through the highlighted text and come back to numbers later)

  • So, let’s say we start at 00:00(MM:SS) (Time column is on the left) and nothing is happening. Everyone is doing nothing. So, I fill “0” at all stages.
  • At 00:01, (like the very first second) we got the very first project, so the backend team starts working on it. I marked it as “1” (project #1). The rest of the teams are waiting until the backend team finishes their first project, so their columns are filled with zero. (can be read as idle).
  • So, in 1 minute (01:01) the first team finishes their part. This means the UI team can start working on project #1. Meanwhile, the backend team can start a new project (#2). The QA team is still waiting. Once UI completes project #1, all teams will be busy.
  • At 02:01, The QA team starts working.
  • So, at 03:01, we get the very first project done! After that moment the whole pipeline is busy as can be seen in the table.
  • As a result, at 04:01 we finish the second project! (just exactly a minute after the first one). Thus, after some initial warm-up phase, the pipeline produces one project per minute.
  • We need 3 minutes to finish a project (the time until all teams will have done their part) — This time is called Lead Time (or LT).
  • Also, the pipeline produces one project/minute — It is called Throughput (or T).

LT = 3 minutes = Backend + UI + Tests = 1m + 1m + 1m

T = 1 project/minute = max(Backend, UI, Test) = max(1m, 1m, 1m)

This is amazing. Look, there is no idle time, every part is doing useful work, and each part works at their maximum speed. Product quality gates are robust, people are happy, no retention, all of it. It is time to find other places for optimization, isn’t it?

Stress testing the delivery pipeline

I want to simulate an outage here. Just once. I do not need more than that in this case. So, The creator forgot to consider future weather conditions. For example, we got an unexpected hurricane at the UI/UX team location, which disabled building electricity for 10 seconds. This will be enough to highlight processes in such a pipeline.

TIPs (cause reading numbers is not that fun as a ready-to-consume conclusion):

  • Mostly, next simulation will show you that by starting using backlog, there is no turning back moment and system is collapsing to worse characteristics
  • We started using the backlog, cause backend team were not affected by the outage
  • Eventually, the outage made you questioning if your teams and the system works well
  • You can read the final chapter and come back to numbers afterward. If you want to go through the simulation to answer your questions.
  • The outage happened between 05:01 and 06:01.
  • The UI/UX team spent an extra 10 seconds while recovering from the outage.
  • So, in a minute (06:01) the team delayed and was still working on project #5. Meanwhile, the backend team did the job well and started working on project #7, but then they put project #6 in the backlog, because the UI/UX team is still busy and working on project #5. (The QA team, meanwhile, finished project #4. They start waiting until the UI/UX team can finish project #5.)
  • In 10 seconds (At 06:11), the UI/UX team finished with project #5 and passed it to the QA team. At that moment, they have to take project #6 from the backlog. Yup, another 15 seconds.
  • After extra 15s, At 06:26, the UI/UX team is ready to start working on project #6. Meanwhile, other teams are working on projects #7 and #5.

At 07:01, the backend team finished project #7. The team started working on the next project (#8) and put project #7 to backlog for UI/UX team. The QA team, however, is still working on project #5, because they got the delay from the UI/UX team in the previous round.

At 07:11, we delivered project #5! (which was affected by the outage) It took a bit more time though — Lead Time = 3m 10s. It is more than the exact time of the outage — 10 seconds. Throughput is affected as well, by the same 10 seconds. Once the QA team is finished with project #5, they again have to wait for the UI/UX team to finish with next project. It looks like a pattern already.

At 07:26, the pipeline is fully loaded again after a short break for the QA team. Each team works on something. Backend works on project #8, UI/UX is taking #7 from the backlog, and QA starts work on #6. Let’s simulate it a little bit further.

Let’s jump to 08:26. The system finished with project #6. LT is 3m 15s and T = 1m 15s. It is a result influenced by the inefficient work when the UI/UX team takes projects from the backlog.

At 09:41, the QA team finished with project #7. Lead time of project #7 is 3m 40s! I bet you can figure out why.

The most interesting thing happened at 10:01, when the backend team put the second item in the backlog! (I’d call that a no turning back moment)

I stopped the simulation at the time because there is no hope the system will recover to initial performance metrics. Things will be only worse if no one intervenes.

How to manage a team

The most interesting part of this to me is how people manage such situations. In reality, most of us do not have such detailed statistics. We should usually rely on our gut feeling and some standard methods of managing an organization. Another thing is that a manager can be new, and not know how the system was operating before. Alternatively, a manager can be responsible only for a part of the organization. Also, the methodologies can be just a representation of people’s minds and previous decisions. There are many standard approaches which take inspiration from past actions. Because of this, commonly accepted decisions may be wrong or biased just by these factors. Let’s go through these standard methods and see what optimizations can be applied.

  • I highlighted with markers a project our teams did after the outage.
  • Red is the outage itself. It is unplanned work.
  • Grey is when there was no work. When a team is starving because of a lack of meaningful work or they are lazy.
  • Yellow is non-optimized work, for whatever reason. In our case, the UI/UX team used the backlog which added an extra 15 seconds.
  • Green is useful work.

So, let’s sum up the analyze:

  • Backend team is our best performer.
  • UI/UX team is not that good, but working hard.
  • However, QA team is idle 25% of their time…

So, what are the common decisions here? Well, the funny version is to hire an agile coach for the UI/UX team and cut the QA team or find them something else to do.

Reflecting.

We all may live in the same situation, in the post-outage system. Most likely, all of you have the same problem with the backlog. Moreover, you do see well-performing teams, but they may not help you or your organization that much. The root problem here is that we do not question useful work as a work which may harm us. We do not question the best performers as a reason why our system produces less than it could by design. In reality, you can see hundreds of scenarios where people help to spread the side effect of the outage. For example, the QA team may not show you the idle time. People are not stupid, and if they can work less, they can start working less diligently, and you will not be able to catch it by statistic or intuition. Alternatively, they can be affected by the Parkinson law and start procrastinating unconsciously.

So far, I can only offer one solution based on my experience. Try to organize your system in a way where people do what is needed or required at the very moment but not what people can do or signed off. Only cross-team collaboration can align the flow here. Just a little help by QA or backend team during the outage can make that system ideal again.

This is why the DevOps approach is so effective, by the way.

--

--

The Borys

Teaching stupid machines to do what smart people want. @DataRobot