Effective work of each IT engineer highly depends on his/her troubleshooting skills. This post dedicated to all who want to start building this skill or looking to improve existing skill with few hints/tips.
Troubleshooting basics
If we put intuition out of the process - the art of troubleshooting can be sized down to knowledge of the system and a simple workflow. To troubleshoot system composed of more than one component you need to identify failed component first. There’s where knowledge of the system becomes necessary. To successfully troubleshoot the system you need to understand how it’s components are tied together. When you build a mental picture of a chain, representing system’s components you can start examining components one by one to identify failed component.
Take clear notes of what ideas you had, which tests you ran, and the results you saw.
Before you begin
- Make sure you have a clear way to reproduce error or check whether it’s gone. In case you don’t have an access to part of the System’s functionality to check it yourself (say due to security reasons) - make sure you have direct contact with the person who can verify / confirm problem is gone. Collect error messages (if available).
- Verify there’s no planned maintenance works in progress on any of the System components (or services the System depends on). Troubleshooting during such periods can make things much worse.
- Initiate Incident Response Process, at least the “Announcement” part, so no changes made to the System components without your approval. This process should also establish clear guidelines for cross-team cooperation, to avoid multiple engineers trying to fix problem at the same time (which also makes things worse).
- Make sure you know what’s the SLA for the System is. This can impact your risk appetite and will define how much time you can spend to understand Root Cause.
Problem classification
Common problems Sysadmins/SREs have to deal with are:
- Availability
- Capacity
- Performance
- Unexpected behavior or Errors
Troubleshooting check-list
- Review monitoring notifications.
- Review maintenance notifications.
- Review recent changes to the System in question (remember though - correlation is not ALWAYS the causation).
- Examine system metrics (CPU, Disk, Memory)
- Examine application log
- Examine network connectivity (from client to impacted system, from impacted system to it’s dependencies)
Localize the problem / Isolate failure
Vital skill in troubleshooting is to identify failed component. This especially important for complex (consisting of more than 2 components) and mission-critical systems. Properly setup monitoring system with comprehensive dashboard to display system metrics can greatly reduce your Mean Time To Recover (MTTR) by indicating failed component.
Pro-tip:
Unless failed component is apparent, and you have sound understanding of component inter-dependencies, you can utilize binary-search approach instead of examining components one by one to reduce time necessary to identify failed component.
Observe system behaviour
Understand failure modes
In what ways system can fail
Postmortem
Even if there’s no one out there to read your Postmortem report (Incident Report, Lessons Learned etc.) except yourself - still write it down! First of all - it will help you to reflect on the steps you took to troubleshoot the error in a structured way, second - it’s a good skill to develop for everyone who want to grow as System Administrator/SRE.
Good postmortem should be blameless, but still honest. You don’t want to point fingers to anyone, otherwise you will end up with culture of covering future risks and failures, which will make your systems less stable. Yet, it should expose all mistakes and hiccups made before and during resolution, so everyone can learn from that and adjust.
- Root Cause Analysis
- Lessons learned