Managing Major Incidents

The ITIL® core volume Service Operation is not particularly helpful with regard to Major Incidents. It basically says: “A separate procedure, with shorter timescales must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped onto the overall incident prioritization scheme – such that they will be dealt with through this separate procedure.”

In our recent “Managing Major Incidents” workshops we have had an opportunity to discuss the topic with a good cross-section of IT professionals; to present our thoughts and; perhaps more importantly, gain valuable feedback as to what represents best practice in the field. What follows is our distillation of that best practice and a corresponding process flow to help support it.

Key Recommendations

1. Be clear what your organisation means by “Major Incident”
2. Appoint one person (preferably the Service Delivery Manager) to determine the severity of the MI and to invoke the MI process if appropriate.
3. Gather together a “war cabinet” of key people to help ensure that adequate, appropriate resources are made available to speedily resolve the MI.
4. Make certain that any escalation to the business can happen speedily and effectively.
5. Place the Disaster Recovery team on stand-by.
6. Be prepared to de-escalate as a solution emerges.

More Information

More information, including a practical process flow and narrative that we believe represents industry best practice in this particular area, is available from the Sysop Resource Centre

You will need to log-in, and possibly register, on the website to access the downloadable resources area. Once there, you will see the categories of downloadable resources, the first of which is “Articles”. Click VIEW RESOURCES and you will see that the first two articles are the Major Incident Process Flow and Managing Major Incidents narrative.

Stuart Sawle August 2014

ITIL® is a registered trademark of AXELOS Limmited.

An ITIL Process Conundrum

A student brought me up short on a recent course when we were discussing the distinction between incident management and problem management. We had been talking about the need for Incident Management to resolve the user’s issue quickly and the purpose of Problem Management to identify the root cause and provide a workaround or a permanent solution.

The scenario surrounding this particular conundrum goes like this. . . .

• A user contacts the service desk and complains that a particular document he is trying to print fails when sent to his local departmental printer.
• The service desk asks the user to redirect the print to a different printer (different manufacturer) two floors down.
• This is successful albeit very inconvenient. The service desk agrees with the user that this issue has been resolved and the incident can be closed.
• The service desk agent believes that this issue needs to be investigated further and raises a problem record.

• The problem management team eventually identify that there is a deficiency in the printer firmware and ask the manufacturer to provide a fix.
• Meantime, based on the experience from the incident, a Known Error record is generated containing details of the workaround that was successfully employed by the originating user.
• This is used on several occasions in the following months to resolve further similar incidents albeit with considerable inconvenience to the users concerned.

• Eventually, the printer manufacturer comes up with an updated version of the firmware. This is tested, found to be a valid solution and a change request is raised to roll the new version out to every printer of this make and model.
• No further incidents are raised.

• Sometime later the original user, based on prior experience, is directing his printed output to the printer two floors down. A colleague asks him why he is doing this. “Because our departmental printer can’t cope with this particular type of document” he replies.
• Well, I don’t have any trouble says the colleague – prompting to original user to try the local printer which, of course, works perfectly.

The IT service provider has clearly let down its customers/ users. But whose responsibility was it to advise the user-base in general, and this particular user in particular, that the workaround was no longer necessary. What went wrong? How would you change processes to improve the communication flow?
Stuart Sawle