Roll Forward

You deployed the latest release of that application yesterday. Maybe it was painful. Took longer than it should have because some other system was down and you couldn’t fully deploy yours. It happens, you got through it and there’s a shiny new version out there for your users. Perhaps you even celebrated because it was the end of a long hard road.

Today is a new day. The first email comes in before noon. Something is amiss. Seems minor, but you look into it anyway. As you dive deeper through layers of systems and configuration and source code you feel a sense of dread. It’s a defect. Not just some minor bug that there’s a workaround for. No, this is a straight up fundamental flaw in the system. And your team introduced it.

Step 1. Do not panic. There is nothing so bad that panic cannot make it worse. Your stress level is probably at an 11. Get it under control. Do not freak out. Remain calm. Do not blame it on anyone. You can do a postmortem later and everyone can learn from it, but don’t lash out. Even if you can. Especially if you can. Own the problem. This wasn’t the junior developers fault or the QA engineers fault. The team did this and the team will fix it.

Step 2. Communicate. Not a panicked email to your manager or your subordinates or your users. Get the important people in the room or on the line. Do you need manual test support? You should have automated tests but maybe you want extra eyes on this. Get the QA lead. Code change? Get your best developers pair programming and looking at the code. Does someone have to deploy it? You should have push button deployments, but if you don’t get the devops team ready. Change process? Well god help you but get someone on it. Depending on the nature of the problem, the users might need to know.

Step 3. Find the root of the problem. Maybe it’s one line, maybe a class, maybe a subsystem. Can you patch it? If not, you have to roll back. It sucks and it makes the whole situation a lot harder to swallow but you have no choice. Production needs working code and you don’t have it. If you can patch, great. You’re not rolling back, you’re rolling forward. Bonus points for writing a test that demonstrates the defect so you can confirm the fix.

Step 4. Get it done. Once the team is convinced they have a fix in place or that a rollback is needed, it’s time to get moving. If you have good system hygiene this will be easy. Commit, pipeline, done. If not, well you might have to move mountains and take some risks. You’re a professional. Know the trade-offs and make good choices. Get it through your environments and get it tested. Hopefully you have automated tests.

Step 5. Follow up. The system failed. Not the system you deployed, the system that allows you to build systems. Somehow a major defect got released and you need to learn from that. This is still not the time to assign blame. Mistakes happen and you only fail when you don’t learn from them. Keep some perspective and some balance. This probably isn’t something that requires some heavy process that will only slow you down. Write it up. Determine the most effective countermeasure that would catch this problem (again, there are usually tradeoffs on time and effectiveness).