Managing Major Incidents

The ITIL® core volume Service Operation is not particularly helpful with regard to Major Incidents. It basically says: “A separate procedure, with shorter timescales must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped onto the overall incident prioritization scheme – such that they will be dealt with through this separate procedure.”

In our recent “Managing Major Incidents” workshops we have had an opportunity to discuss the topic with a good cross-section of IT professionals; to present our thoughts and; perhaps more importantly, gain valuable feedback as to what represents best practice in the field. What follows is our distillation of that best practice and a corresponding process flow to help support it.

Key Recommendations

1. Be clear what your organisation means by “Major Incident”
2. Appoint one person (preferably the Service Delivery Manager) to determine the severity of the MI and to invoke the MI process if appropriate.
3. Gather together a “war cabinet” of key people to help ensure that adequate, appropriate resources are made available to speedily resolve the MI.
4. Make certain that any escalation to the business can happen speedily and effectively.
5. Place the Disaster Recovery team on stand-by.
6. Be prepared to de-escalate as a solution emerges.

More Information

More information, including a practical process flow and narrative that we believe represents industry best practice in this particular area, is available from the Sysop Resource Centre http://www.sysop.co.uk/your-account/downloads.

You will need to log-in, and possibly register, on the website to access the downloadable resources area. Once there, you will see the categories of downloadable resources, the first of which is “Articles”. Click VIEW RESOURCES and you will see that the first two articles are the Major Incident Process Flow and Managing Major Incidents narrative.

Stuart Sawle August 2014
http://www.sysop.co.uk

ITIL® is a registered trademark of AXELOS Limmited.

An ITIL Process Conundrum

A student brought me up short on a recent course when we were discussing the distinction between incident management and problem management. We had been talking about the need for Incident Management to resolve the user’s issue quickly and the purpose of Problem Management to identify the root cause and provide a workaround or a permanent solution.

The scenario surrounding this particular conundrum goes like this. . . .

• A user contacts the service desk and complains that a particular document he is trying to print fails when sent to his local departmental printer.
• The service desk asks the user to redirect the print to a different printer (different manufacturer) two floors down.
• This is successful albeit very inconvenient. The service desk agrees with the user that this issue has been resolved and the incident can be closed.
• The service desk agent believes that this issue needs to be investigated further and raises a problem record.

• The problem management team eventually identify that there is a deficiency in the printer firmware and ask the manufacturer to provide a fix.
• Meantime, based on the experience from the incident, a Known Error record is generated containing details of the workaround that was successfully employed by the originating user.
• This is used on several occasions in the following months to resolve further similar incidents albeit with considerable inconvenience to the users concerned.

• Eventually, the printer manufacturer comes up with an updated version of the firmware. This is tested, found to be a valid solution and a change request is raised to roll the new version out to every printer of this make and model.
• No further incidents are raised.

• Sometime later the original user, based on prior experience, is directing his printed output to the printer two floors down. A colleague asks him why he is doing this. “Because our departmental printer can’t cope with this particular type of document” he replies.
• Well, I don’t have any trouble says the colleague – prompting to original user to try the local printer which, of course, works perfectly.

The IT service provider has clearly let down its customers/ users. But whose responsibility was it to advise the user-base in general, and this particular user in particular, that the workaround was no longer necessary. What went wrong? How would you change processes to improve the communication flow?
Stuart Sawle
http://www.sysop.co.uk

Many IT service continuity plans are fundamentally flawed

Many IT service continuity plans are fundamentally flawed. Most business managers expect that all IT services will be restored within 48 hours or so of a disaster. Alarmingly, Sysop research indicates that it may actually take six months before all services are returned to normal!

The mismatch between expectation and practical delivery is brought about by a number of incorrect assumptions, including:

  • that non-critical services can be recovered in similar timescales to the “mission critical” services for which detailed ITSC plans have been developed.
  • that all services can be recovered to readily available “commodity hardware”.
  • that suitably-qualified IT personnel will be available to support the recovery in the numbers required for the time required.

But crucially, the most significant factor is the high levels of support effort required to sustain the newly-recovered services. This support commitment will drastically reduce the resource available to recover the remaining services.

Most IT departments have around 20% of their services defined as “mission critical” in a total population in excess of 50.Some 80% of services will take more than two weeks to recover; 50% will take more than a month; 25% will take more than three months.

IT Services Need to be Available in a Crisis
Experience of major contingencies (i.e. those that affect more than just IT infrastructure) reveals that emergency co-ordination teams need effective IT immediately. As the precise nature and impact of the contingency cannot be predicted, IT specialist resource is needed to provide emergency co-ordination teams with their requirements in an efficient and flexible manner. This activity will always take priority over the recovery of routine IT. As organisations become increasingly IT dependent it becomes even more necessary for routine IT (and the data / information upon which management depend) to be available to manage the crisis.

Building a Disaster Tolerant Infrastructure
By planning strategically it is possible to develop an I.T. infrastructure capable of maintaining IT service continuity throughout even a major contingency. modern server clustering and data storage mirroring can ensure the automatic fail-over of every single system within minutes – requiring no resource, intervention or dependency on scarce IT skills. With correct planning a highly-available infrastructure can be implemented with no overall increase in the Total Cost of Ownership.

Getting to Grips with Problems

First of all, thanks to my colleague John Allder for prompting me on the topic of root-cause analysis or more simply put: getting to grips with problems.

The phrase ‘root cause analysis’ is often used in a general sense to describe the activity of identifying the underlying cause of an incident.  However, the phrase Root Cause Analysis (RCA) is also given to a specific technique that is intended for use in investigating a series of actions or occurrences that lead to an undesired outcome.

Every major problem should be reviewed to learn lessons for the future.
• What was done correctly
• What was done wrong
• What could be done better in future
• How to prevent recurrence
• Whether there has been any third-party responsibility and whether follow-up actions are required

RCA helps to identify not only what happened and how it happened but also why. Only by understanding why will we be able to devise workable corrective measures. For instance, suppose a network technician disconnects a working router rather than a broken one. A typical investigation might conclude that human error was the cause and recommend better training or that technicians should take more care but neither of these is likely to prevent future occurrences. RCA assumes that mistakes do not just happen but that they have specific causes, and would ask ‘why?’ In the case of the poor network technician the RCA analyst might ask ‘was the router properly labelled?’, ‘was the technician told which router was faulty?’, ‘is there a recognised procedure for deciding whether a router is working or not?’, ‘did the technician know what it was?’

Root causes have four characteristics:
1. They are specific causes: ‘human error’, for example, is too general.
2. They are causes that can reasonably be identified: RCA must be cost beneficial so the analyst must know when to stop the investigation.
3. They are within the control of the management of the organisation. The analyst is looking for causes that can be addressed by the organisation. Although adverse weather conditions might very well have triggered the incident, we cannot do anything to affect the weather and so that is not an appropriate root cause. We can of course do something about how we are impacted by adverse weather and perhaps our root cause / resolution might lie there.
4. They can be addressed by specific solutions. A vague recommendation such as ‘ensure that technicians follow defined procedures’ is wholly inadequate and probably means that more thought needs to be given to identifying the specific cause.

RCA is a specific discipline. It follows four distinct phases:

• Data Collection
• Charting
• Root Cause Identification
• The Development of Recommendations

Carried out properly, Root Cause Analysis will ensure that an organisation learns all of the lessons from a major disruption to service and reduce the risk of future failures. It will help staff to identify ways not only of reducing the likelihood future disruption, but also of limiting the impact of any disruption that does occur.

http://www.sysop.co.uk