Be part of the solution, not part of the problem
As a custodian of a door key, your business values stem from enabling access, rather than restricting entry.
We went through a terrible outage last Saturday. It was caused by third party, but I am responsible for trusting them. I shared with my team a couple of lessons learned from the incident. Nothing is really new, but I thought it might be of interest to others.
Comments are welcome and appreciated!
One of our clients hosted a mission-critical system on a domestic cloud provider and rented several leased lines linking to their on-prem facility.
A month ago, the cloud provider arbitrarily changed our production network without any notice, causing all leased lines to go down for 45 minutes.
This weekend, their system had a problem killing the Internet connection from our system cluster for 7 hours. Incidents are unwanted, but the way they handled it was the last straw.
When we detected the problem, we immediately set up a war room and invited the provider in to coordinate troubleshooting. I wanted them to share what happened and provide timely updates on what they were doing.
It took them hours to gather the right people. When they finally joined the room, they didn’t want to share any information, and continued to take action behind our backs. As a result, not only they couldn't fix the Internet line, but they also managed to disrupt the leased lines. Had they told us that they would shut down our lines, we would have a chance to prevent another 45 minutes of downtime.
Once again, they made changes to our production network without telling us, even when we were in the same meeting room with them. I couldn't believe how low the bar could go for this provider, which is supposed to be the best in Vietnam. How can we trust them when they are always part of the problem? No, we can’t. We’ll move.
Each incident is a learning opportunity, so what else have we learned? Here are a couple of notes I took during the incident, hopefully you'll find it useful.
Notify all stakeholders before making any production changes
When a problem occurs, your notification becomes part of the solution. If no one knows what you did, of course you become part of the problem.
Give people a chance to criticize your plan when it’s still on paper. If you think that conversation is too difficult and uncomfortable, imagine how more so it’ll be after you have caused a massive outage.
Overcommunicate
When there is a problem, you should proactively provide as much information as possible, because information is always part of the solution.
There’s no such thing as overcommunicate. You want to boost the communications bandwidth between you and the stakeholders, and ask everyone to share as much information as possible. This is why we created a war room, and called for all hands on deck.
You can't stay silent and secretly try to fix it on your own. The more urgent the situation, the easier it is to make the wrong decision. Sharing what you plan to do allows others to cross-check for you. This is for your own benefits.
Even if the problem is your fault, you still have to widely announce your actions to avoid stepping on each other's toes. The last thing you want is another incident on top of the original incident. Don't let error pile on error.
Document critical production systems
You can't hold everything in your head, because you'll be sick, on leave, run over by a bus, or simply forget what you thought you'd remember forever, and then you become part of the problem.
I have seen very senior engineers making this mistake of not sharing information. They thought that would make them more important and improve their job security. Nothing is further from the truth.
They failed to understand that by hoarding information they’d become a single point of failure, which is a risk from the business point of view. Strong business leaders would have no choice but to eliminate said risk.
As a custodian of a door key, your business values stem from enabling access, rather than restricting entry.
Make yourself accountable
Accountability can mean different things to different people. For me, it means that if something went wrong, it should be clear who should own and fix it.
Counterintuitively, making yourself accountable — in some sense making it easier for people to blame you — actually helps you become part of the solution.
These days we hear about DevOps practices such as GitOps, Infrastructure as Code, etc. In addition to automating system management, these practices reduce errors by eliminating unaccountable changes.
When there's a permanent record of every change, it's harder to sweep errors under the rug. When people cannot hide their mistakes, they will be more careful and coordinate better with others when making changes.
Writing postmortems also helps reduce errors by increasing accountability. An incident is not over when the immediate problems are fixed. It is over only when the root causes are identified and addressed, which won’t happen unless people are empowered to take accountability.
While postmortems empower people to own problems, they have to be blameless to be effective. On the surface, blameless (do not blame people) and accountable (who should own this problem) seem mutually exclusive, but deep down they are both required to get problems fixed.
As a leader, you don’t want to blame your team and you don’t want your people to blame each other, otherwise you would never get the truth of what actually happened. Make it clear to everybody that you’re responsible for failures, which is always true. At the same time, you don’t want no accountability, otherwise you would never get anything done. This is the essence of radical candor.
At Calif, we think it’s okay to make mistakes, but it’s not okay not to learn from them. Learning starts with owning. Even though the cloud provider caused these recent problems, it’s my responsibility. I trusted them, and I didn’t direct resources into designing a more resilient system. This is my mistake, and I will fix it.
Many people are afraid of the word “responsible”. They worry that if they took responsibility for failures, they would be blamed for being part of the problem. They got it so wrong. Business leaders want their problems fixed first and foremost. Assigning blame doesn’t help them achieve that goal. Whoever takes responsibility and owns problems immediately becomes part of the solution, from the business point of view.
You need to take 100% responsibility for yourself. Working with ISPs in Vietnam is a unique experience.
Notify all stakeholders before making any production changes
-> Even so, they will miss to notify you. You need to keep a positive connection with your ISPs.
Overcommunicate
-> Don't waste your time to hear about ISP's issues. You had no idea what they were doing. Just focus on when the problem will be solved and think about the backup solution by yourself.
The most significant thing that you don't have a strong connection with the highest level engineer of the ISPs. As a result, you can't completely understand what occurred.
-> You have a problem with your design, don't rely on only one ISP at a time, using multiple ISPs are the best choice, each of them has diferrent advantages.
Sau khi theo dõi thì em cũng nhận thấy, đôi khi tính khi nỏng nảy của mình, bộc trực hay đổ lỗi vụn tính cũng là một phần của sự việc.
Có thể trong môi trường mọi người hay than vãn về mọi vấn đề, nhưng mọi người có thể nhìn nhận đó là vấn đề chung mà cả đội cần phải đối mặt.
Hiện tại em cũng mới đi làm, thì em suy nghĩ người ta thuê mình để giải quyết vấn đề chứ không phải để tạo ra vấn đề.
Cám ơn anh vì bài viết không chỉ bổ sung về mặt công nghệ kỹ thuật mà còn là cả mindset làm việc chung.
Em tự rút ra cho mình thôi ạ.