12 Comments
Oct 19, 2023Liked by Thai Duong

You need to take 100% responsibility for yourself. Working with ISPs in Vietnam is a unique experience.

Notify all stakeholders before making any production changes

-> Even so, they will miss to notify you. You need to keep a positive connection with your ISPs.

Overcommunicate

-> Don't waste your time to hear about ISP's issues. You had no idea what they were doing. Just focus on when the problem will be solved and think about the backup solution by yourself.

The most significant thing that you don't have a strong connection with the highest level engineer of the ISPs. As a result, you can't completely understand what occurred.

-> You have a problem with your design, don't rely on only one ISP at a time, using multiple ISPs are the best choice, each of them has diferrent advantages.

Expand full comment
author

Thanks for the wise words, appreciate it!

Expand full comment

Sau khi theo dõi thì em cũng nhận thấy, đôi khi tính khi nỏng nảy của mình, bộc trực hay đổ lỗi vụn tính cũng là một phần của sự việc.

Có thể trong môi trường mọi người hay than vãn về mọi vấn đề, nhưng mọi người có thể nhìn nhận đó là vấn đề chung mà cả đội cần phải đối mặt.

Hiện tại em cũng mới đi làm, thì em suy nghĩ người ta thuê mình để giải quyết vấn đề chứ không phải để tạo ra vấn đề.

Cám ơn anh vì bài viết không chỉ bổ sung về mặt công nghệ kỹ thuật mà còn là cả mindset làm việc chung.

Em tự rút ra cho mình thôi ạ.

Expand full comment

Really like your way of sharing🫡🫡

Expand full comment

Thank you for sharing. I leant something from it!

Expand full comment
Oct 25, 2023·edited Oct 25, 2023

Completely agree with YvYlynk. You have to take 100% responsibility. Even most trusted cloud service providers such as AWS, GCP still have outages, and you should implement proper design to mitigate the risks caused by service provider, or build your own BCP. It's unbelievable that a mission critical system relies completely on just one service provider without any backup or emergency plan.

War room should be used to control your actions in that backup plan, not to ask service provider to share their internal information with you. Even if they do, you might not familiar with their system, thus wasting their precious time to explain. And I doubt any service providers would do that kind of things.

I understand that you were frustrated. It's because when the outage comes, there's no way you can do but wait for service provider to fix their problem. You take responsibility for trusting them. But trusting them didn't cause that 7 hours downtime. It's your bad design, no BCP, no risk management... all come to poor management. Sorry but I think you still lack some foundation of IT management.

Expand full comment
author
Oct 25, 2023·edited Oct 25, 2023Author

Please. I helped design and audit some of the largest systems on Earth. I was a senior staff software engineer at Google, not so many engineers getting to that level without a deep understanding of large scale system design.

You made many assumptions about me and our situation. We are well aware of this single point of failure, but there were higher priority problems to address first.

Edit: typos

Expand full comment

Take it easy, man. You might be a technical expert but you still lack of basic management knowledge and sense. Single point of failure is a technical problem. It's OK that you have to priorize others things. It does happen, as always. But you didn't register that as a serious risk for a mission critical system prior to the incident. You have no way to reduce that risk, or have any plan to bring system function back to work by your self. You only take risk mitigation actions after the incident because you now know that the risk associated with that critical mission system should have higher priority than other things. That's poor management. And that is room for improvement.

Trying to blame that service provider doesn't make you better IT managers nor reduce your accountability and responsibility for that incident. Blaming others and take responsibility only for trusting them is not the attitude of improvement. You might still come into the same situation with other service providers.

Just as you said, I'm not in your shoes. I might not be as expertise as a technician like you, but from my management point of view, your management for that mission critical system still have many rooms for improvement.

Expand full comment
author

I'm not sure what your goal is. Do you want to make me feel bad about myself?

I'm kinda used to this kind of criticism. I've heard repeatedly over the years from random Internet persons like you that I was just a technician with bad takes on politics and whatnot.

Maybe try something new?

Expand full comment
Oct 26, 2023Liked by Thai Duong

Sorry if my words are too harsh. I didn't mean to offend. I just don't agree with some of your points. It's ok to make mistake but we must take the right lesson from it.

In the end, no one can do everything. If you don't have enough management experience or sense, learn it from others. If you don't have gift in that field, just let other one who can do it better handle that for you, and you can focus on what you are good at.

Just my thought.

Expand full comment
author

All good.

Thanks for the advice.

I'm a stubborn person that always tried to go beyond my ability. I mostly failed, but sometimes I succeeded.

Expand full comment

“As a leader, you don’t want to blame your team and you don’t want your people to blame each other,”

- Có một số lãnh đạo họ luôn đổ lỗi cho nhân viên và những người xung quanh nên mới dẫn đến tình trạng

“Many people are afraid of the word “responsible”. They worry that if they took responsibility for failures, they would be blamed for being part of the problem”

Expand full comment