Published papers by Ian Charters FBCI of Continuity Systems Ltd.
You are welcome to reproduce the articles in internal organisational publications provided the source and authorship is clearly identified
Pulling apart the BIA Continuity Issue 2011/1
What does an embedding programme look like? BCI Conference 2012A shotgun marriage or amicable friendship? Unpublished 2010 Putting the Business back in Business Continuity Continuity Central 2007
Do we have a Risk Appetite or Planning Blight? Continuity Central 2008
Resilience - the 'missing' element in BCM Unpublished 2012A critical question : Is your job really necessary in a crisis? Continuity Volume 7, Issue 1
Stuck in the tunnel! Blueprint (the Journal of the Emergency Planning Society) March 2003
How has 9/11 affected Business Continuity thinking and outlook? Business Continuity Management Forum 2002
The reality of Worst Case Scenarios : Continuity
(the Journal of the Business Continuity Institute) Volume 4, Issue 4
Is Business Continuity relevant to Emergency Planning? : Blueprint (the Journal of the Emergency Planning Society) June 2000
Justifying the Business Continuity Project : Continuity Volume 2, Issue 1
Risk Evaluation and Control : The Definitive Handbook of Business Continuity Management ed. A.Hiles and P.Barnes (Wiley)
Is risk management relevant to the BC Manager? : Continuity Volume 2, Issue 4
Making a success of BCM - The role of the Independent Consultant : Facilities Management Today, March 2000.
The intention of this article is to give you some idea what an embedding programme could look like in the brave new world of ISO 22301 – but it should be just as relevant if you have no intention of certification as it still a vital part of the overall BCM programme.
An entire chapter of BS 25999-1 and significant section of 25999-2 are entitled ‘Embedding BCM into the culture of the organisation’. However this title does not appear in the new International BCM Standard (ISO 22301) – so do we no longer have to embed our BCM practices – or is there no longer a culture to embed into?
If you look more carefully in ISO 22301 there are headings such as ‘Competence’, ‘Awareness’, ‘Communication’ and the various requirements to conduct maintenance at regular intervals. You will also find a requirement to ensure the integration of BCMS requirements into business processes’. So all the elements of embedding are there! The main reason why the requirements have become dispersed was the obligation to use the new standard ISO Management Systems structure.
In addition the requirement to ‘improve the BCMS’ implies a need to create an environment in which BCM can thrive and become ‘the way we do things around here’.
To create this BCM-friendly environment we will have to:
- create support by changing attitudes and behaviours
- improve capabilities by developing skills
- ensure plans, strategies and other BCM elements stay up to date
whilst making best use of our limited BCM resources of time and budget.
We may need to change attitudes to BCM within the organisation so that plans, maintenance and exercises are done willingly and with understanding of their purpose. We can often create awareness most effectively through appealing to the personal goals of individual staff to gain the necessary buy-in so we will need to vary the message to gain the interest of different groups for example:
- Top management – we should relate our BCM message to the organisation’s strategic objectives, future financial results and increased certainty of delivering on them
- Middle management – we should tie our message to the individual’s performance evaluation where this requires involvement in response and recovery planning and point out the peace of mind they will gain from being confident that disruptions can be managed effectively
- Sales and marketing – we should get them to see that BCM can create new market opportunities through being able to offer greater resilience to customers, possibly with higher returns (and increasing their own bonus)
- All staff – we should sell it to staff as a protector of their well-being and their jobs
Some ways we can get the message over to all staff are:
- Include the BCM messages - such as incident response procedures, bad weather policies and the importance of maintaining contact information – as part of induction or on-going training
- Encourage them to apply business continuity principles to their home and family – perhaps by putting together a home ‘grab bag’ in case they are evacuated and taking copies of vital documents and spare keys to a safe location (a relative’s house perhaps)
- Timetable a focus during Business Continuity Awareness week – and take advantage of the materials available on the BCI website
We can learn a lot from marketing here. We should view the organisation as a set of customers and find out what makes them want to buy then make it interesting and attractive. Helen Sweet MBCI used to publish topical doggerel verse to catch the eye –I remember the Christmas one about Santa having his sleigh clamped by a traffic warden – with the hope that he had a recovery plan as the festivities were approaching fast. At a meeting a couple of years ago the Christmas meeting of the BCI NE Forum played a team game on the same subject – Santa’s little helpers had to manage the continuity of present distribution with their chosen recovery strategies – with Lapland news broadcasts delivering the bad news of Arctic Skuas bringing down power lines, reindeer flu interrupting distribution and wrapping paper shortages. This sort of event could go down well at a team briefing rather than a straight BCM presentation.We need to develop competences throughout the organisation, both in BCM programme development and in response and recovery. This may include:
- Training – ensuring that individuals and teams have the required skills for specific BCM programme or response roles such as undertaking a BIA, decision making, salvage techniques or effective chairing of a team
- Experience – we should ensure they can apply the knowledge and skills acquired. For example media training could be practiced in the context of an incident management exercise
- Education – to extend the BCM team’s knowledge via forums, diplomas, online learning or conferences so they stay current with developments in the discipline
We also need to try to integrate BCM maintenance into the processes of the organisation – so they are done as a matter of routine. These could be:
- Scheduling a strategic BIA discussion as part of the Executive’s discussion of the organisation’s strategic plan
- Setting a policy that a BIA, business continuity strategy, planning and exercising forms an integral part of the development and testing of the launch of a new product or service
- Ensuring that a review of the business continuity impacts is included into change control procedures for IT or production upgrades
- Checking that business continuity procedures become part of the department’s procedure documentation and are maintained at the same time
- A BCM exercise being scheduled as a follow-on from a regular fire-evacuation exercise
Resource planning – we need to ensure the BCM resources are available to undertake the various events
- Date availability – we need to ensure the availability of staff and required resources such as rooms and equipment
- Interrelationships between all these activities – we need to ensure that skills training precedes events in which those skills can be practiced
- Interrelationships with other programmes and training – we may take advantage of programmes run by other parts of the organisation such as induction, other related training and staff events to put over a BCM message
- Integration with maintenance and review cycles already within the business
So an ‘Embedding programme’ - an annual calendar and project plan - is suggested that pulls together all these elements and enables the deadlines and resource requirements to be identified and managed.
Despite this plan, we should always be prepared to take advantage of opportunities. We sometimes throw the planned programme away because something has happened that provides an opportunity of increased awareness – in the way that the threat of a pandemic caused many organisations to develop contingency plans for staff loss.
Having delivered our programme how do we measure its outcomes? Of course, we will need to decide what we are going to measure and establish a baseline to measure against at the beginning of each programme period.
We can measure individual’s competence through:
- Appropriate training and exercises attended
- Records of personal development such as the BCI’s CPD
- In-house testing or assessment of experience
- Appraisal against job description and objectives External qualifications earned
- The evaluation of exercise outcomes or response to real incidents
- The currency of response and recovery plans
- Surveys of staff – to assess awareness and knowledge of the BCM policy and their role in incident response
- Achievements against the business continuity objectives set by management – such as the completion of programme milestones
It should go without saying, though there are requirements in ISO22301, that the competencies of staff and other measures are appropriately documented.
I would commend to you the new version of the GPG coming out in March 2013 – BC awareness week – which has been expanded and the embedding section reorganised along the lines I have been following in this article.
So I suggest you should developing an annual programme for embedding BCM, coordinated with the exercise programme – as this provides a useful way to structure and coordinate many of the BCM programme management activities and to ensure that they are delivered in the most effective way.
We should also attempt to measure the effectiveness of our programme of event whilst always remembering that not everything worthwhile can be measured.
There has always been a disconnect between the design of a BCM recovery plan - to return the organisation to normal operation - and the observation that organisations that suffer a major disruption may recover but are changed by the experience however closely they followed the plan. The classic study by Knight and Pretty (ref) identified 'non-Recoverers' and 'Recoverers' as two distinct groups - the later showing long-term improvement in shareholder value after the incident.
It is suggested here that this improvement, while initially reflecting good management of the incident may reflect in the longer term the ability of the organisation to adapt to the new situation it finds itself in which could be a combination factors such as the re-focussing on business objectives, the ability to handle the opportunities presented by the publicity they 'enjoyed' and the exploitation of the enthusiasms and skills of staff developed during the incident. Much of this impetus would be lost were a 'return to normal' be applied literally.
This suggests that our recovery planning should identify an extra step - perhaps an adaption phase that, once the recovery has stabilised the situation, encourages the organisation to take advantage of the opportunities that have been presented by the incident. This may involve permanent relocation, expansion of production, changes in management structure, enforcement of a changed regulatory framework or identification of new market opportunities. The capability to undertake this adaption to the new environment neatly matches the biological definition of resilience.
The 'prevention' approach to resilience - the attempt to integrate the identification, analysis and control of risks from across the organisation into a central repository is likely to be counter-productive since it tends to reinforce the status quo and inhibits change. The one thing that can be predicted with certainty is that change will occur - both planned and unplanned (an incident) so it makes sense to put in place resilience that will support both types of change.
When a paint factory burnt down a few years ago, the company bought paint from competitors to satisfy their customer orders. It quickly became apparent that customers liked being able to buy all their various paint requirements from a single independent outlet so they became a paint wholesaler instead and were more profitable than before.
As a first draft of what could create this resilience capability to support (and embrace) both planned and enforced change the following requirements are suggested:
- A clear understanding at all levels of the objectives and values of the organisation and a clear customer and market-led focus
- A deep knowledge of the operation and interconnectedness of the organisation's activities so the impact of change can be predicted
- The ready availability of alternative resources and knowledge of a range of options
- A willingness and flexibility of staff through cross-training, multi-skilling, awareness and motivation
- Training and exercising to develop skills of assessment, analysis, solution-finding and leadership
- Open lines of communications and trust between staff and top management that allows for ideas and innovations to be taken on board
- An incident recovery capability that provides the stability and time after a disruption to enable the opportunities to be identified and exploited
If the above list is anywhere near the description of a resilient organisation then it is clearly bears little relation to the 'preventative' resilience approach. Instead its closeness of fit to existing good BCM practice should be obvious.
The implementation of a BCM programme should already involve the elements of training and planning so the additional effort to prepare for the resilience phase appears trivial. For maximum benefit it should also be integrated with the (planned) change control and planning activities throughout the organisation. In practice, though, the cultural change required to respond to innovation is likely to be difficult to achieve in the traditional hierarchical management structure and must be led, as always, by example from the top.
In section 8.1 of BS 25999-1 it states that: “The range of threats to be planned for should be determined by the organization’s risk appetite”. Risk appetite is defined in the glossary as “total amount of risk that an organization is prepared to accept, tolerate or be exposed to at any point in time”. It looks innocuous but how do we interpret this statement to set the scope and content of the Business Continuity plan? Do we write a plan that is written around predictable and likely threats? Or do we ‘weigh’ the risks and decide no BC plans are required but still claim conformance with the standard?.
Perhaps this statement should be in the previous section (determining business continuity strategy) since it is a recommendation about the range of threats which the Continuity Strategy (rather than just the plans) should address. What is concerning, for those attempting to implement the standard, is that no guidance is forthcoming as to how such a ‘total level of risk’ is to be measured ; which is surely a prerequisite of a decision as to whether to accept exposure to it.This article suggests a possible interpretation for this ‘range of threats’ and how this should determine the scope of strategies and plans. We make certain assumptions when developing BC strategy and plans but these are often not documented and may therefore be misunderstood outside the BCM team. What does become quickly apparent is that we lack suitable terminology that succinctly express these ideas to those around us - exemplified by the ‘planning for a worse case scenario’ cliché which can be interpreted as anything from a computer failure to nuclear annihilation. When incidents occur the impact can be measured on a variety of scales including geographical, economic, physical and human impacts. Clearly the appropriate Business Continuity approach to an incident will depend on the scale of the actual event. But if we are hoping to plan a response in advance we do not have the luxury of prior knowledge of the impact. Instead, we should recognise that there are certain thresholds in an increasing intensity and scale of impacts at which the BC response will need to change and may be ineffective.
A localised incident, affecting a single site with minimal injuries to staff is well within the capability of a Business Continuity programme to provide an effective response, restoring the operation of the organisation within acceptable timescales. Even if the locality is affected, an adequate separation between alternate sites and data storage locations should enable rapid recovery.However a more widespread incident at or near a site may have impacts on the surrounding community causing economic problems, environment damage and even long-term relocation of population. A small business (such as an independent retail outlet) may have been totally dependent on this local market and infrastructure. A Business Continuity recovery strategy is designed to ensure that the business will be recovered to, roughly, the same position it was before the incident but if the market, labour or infrastructure situation has changed significantly this may not be appropriate and strategic decisions are required. A serious, damaging event may also result in fatalities or long-term absence. Documentation and cross-training can only go so far ; the loss of a significant proportion of the workforce may make it impossible to cover the required roles and run the business as before. Even a multinational organisation may struggle to maintain the service in an area following significant local loss of staff. As well as problems caused by the lack of skills, the cause of the staff loss may result in reputational damage to the ‘brand’ making recovery in that location unviable.
The response of the local or national authorities to an incident should also be considered. Emergency powers may be imposed in response to a major physical incident, civil strife, invasion or war could occur and these may fatally hamper the ability of the organisation to invoke its recovery procedures. Equipment may have been requisitioned, transport may be unavailable and personnel unable to take the required actions. In this case the appropriate strategy may be one of withdrawal from the area rather than continuity of the business.
If the incident is of an economic nature (such as a banking collapse) then Business Continuity methods and strategies provide no direct responses for these problems though some of the impacts, such as failure of suppliers, may be mitigated by supply-chain management. In these circumstances a thorough review of the strategic business direction may be required ; though one would hope the BC Manager would be asked to contribute their unique view of its operation to this discussion.
The real ‘worst-case’ - a total annihilation of a district, region or the whole world is a scenario for which business continuity has no answer because there is no business to recover (though civil authorities may still need to retain control). Businesses providing services solely to the Chernobyl region or in the parts of New Orleans left abandoned will never recover since there is no longer a requirement for those services. Likewise in a war-zone an evacuation may become permanent withdrawal.These are complicated (and somewhat dismal) ideas to get over to senior management when developing BC strategy and we lack the language to express them succinctly. It is relatively easy to express geographical limits in the commercial sector in terms of viability and ‘market areas’ but more difficult for the provision of appropriate services in the public sector. However few useful terms exist to express the scale and intensity of an incident in terms of disruption or loss of life - and ‘risk appetite’ isn’t one of them.
We cannot develop a rational BC strategy without certain assumptions being made about the practical limits to our planning. This is not a risk decision - assigning probabilities as scientific as roulette for these sorts of events. Instead each organisation must reach their own conclusions about the limits to its ability to recover which may include commercial, regulatory or reputational obligations ; These limits can be explored, for example, in terms of a geographical scale, intensity of impacts and staff loses from which logical decisions can be made on such issues as separation distance, the extent of staff resilience required and the point at which the BCM ‘white flag of surrender’ is raised and top management have to take hard strategic decisions about a new direction for the organisation in a changed environment.
In a previous article I proposed the term ‘Maximum Survivable Incident’ (MSI) as a way of expressing geographical limits which some people have found a useful term. However other terms and metrics are need to refine our vocabulary in this area and make it understandable to top management so they can make appropriate decisions otherwise we risk giving the organisation a false sense of security as to the capabilities of our plans.
When an existing strategy is reviewed an organisation may well discover that their planning assumptions have led to their effective MSI being lower than they required. For example alternate locations may be too close or staff skills too concentrated. Improvements in their BCM strategy should then be aimed at increasing their ability to cope with a bigger incident - and we should have metrics which enable us to measure this improvement in a way that can be demonstrated to management.
Section 4.2 (Context) in BS 25999-1 says that BCM Policy should be ‘appropriate to the nature, scale, complexity, geography...’. Part of that context should be to state in the Policy the limitations of a BCM response and to identify at what points the scale of the impact will be in the hands of the civil authorities or require a strategic withdrawal rather than the planned business recovery. However we are hampered partly by a lack of accepted terminology and sometimes an unwillingness to face up to the realistic limitations to our response.
Dominic Hill's recent article 'Business Continuity - are we still missing the point?' questioned the emphasis of BS 25999 on recovery and plans by reference to a dictionary definition of 'continuity'. He appears to argue from the definition of 'unbroken and consistent existence' that correct design and location of a resilient IT system could make it unnecessary to maintain and exercise recovery capabilities.
Having quoted a definition for 'Continuity' one should also do the same for the other word 'Business'. One web definition gives 'a commercial or industrial enterprise and the people who constitute it' - but no definition mentions IT as a necessary or even key requirement of an entity to be a 'business'. That many organisations have a reliance on IT is obviously true but it is equally true they have a reliance on staff, premises, suppliers and many other resources. Therefore it is suggested BCM's objective is to ensure the 'unbroken and consistent existence' (continuity) of the business not necessarily that of the various support services (such as IT) except, perhaps, when the company's business is in providing IT services to third parties.
The standard is very clear that the scope of the BCM programme is focused on the delivery of the products and services of the organisation. This is because the success or failure of the response to a disruption will be judged, not by the organisation itself, but by those to whom those products and services are delivered. So does such delivery have to be 'unbroken'?
In reality, demand for almost all services is irregular and tolerant of some disruption. With the exception of life support systems, air traffic control and emergency control rooms there is a tolerance by customers of service unavailability for a period of time - which may vary from minutes to weeks depending on the service. We do not expect services from many businesses over holidays and week-ends and how many people change their bank immediately every time an ATM stops working or their on-line service is unavailable? It may also be possible for the organisation to provide an acceptable service to customers for a period of time without the use of IT systems or to contract another company to provide the service for the duration - thus possibly dispensing of the need to resume anything for a while.
There is a cost to recovery services but there is also a high cost of installing and maintaining fully resilient systems across multiple locations. Continuously available IT systems will be a significant outlay and a continuous drain on the finances of the organisation and therefore take considerable justification if their non-availability can be tolerated by the business for more than a few hours. Indeed, other organisational resources may actually be more urgently required than internal IT systems and may therefore demand resilience more than IT. This is why the standard lays such stress on 'Understanding the Organisation' before attempting to jumping straight to providing 'solutions'.
In addition resilient systems are, by their nature, complex and while may offer higher availability during 'normal' situations require more expensive disaster recovery solutions and more expertise and time to resolve problems when they do fail.
More worryingly, the article seems to suggest that a properly designed resilient IT system does not require a recovery strategy or exercising ; justified with the cliche 'would it not be better to avoid the incident in the first place'. There seems to be widely held belief that labelling a threat as 'unlikely' stops it happening to the point where it can be ignored. It is true that obvious measures should be taken to reduce risks and disruptions. However we have to appreciate that because every location and organisation is unique, the statistical information to label a threat as 'unlikely' with any level of certainty does not exist. Therefore, where the outcome could threaten business survival, this uncertainty should be taken into account. As Nassim Nicholas Taleb says, warning about the narrow-focus of specialisms in his recent book 'The Black Swan' - "We can't get much better at predicting. But we can get better at realising how bad we are at predicting"
So the aim of BCM is, surely, to enable the organisation itself to remain unbroken and continue to exist during and after a disruption. To determine appropriate resilience and recovery strategies to achieve this requires a deep understanding not only of the organisation and its operation but also the market in which it operates. This should focus primarily on its customers but also its competitors and its other stakeholders ; and understand how and over what time period these would react during a disruption. Only then can the recovery timescale of activities and support services be defined to provide the required level of continuity.
I have lost count of the number of articles that urge RM and BCM to kiss and make up. Hopes are expressed, forlorn as it turns out, for the new GPG to embrace the concept of probability and break down the barriers between RM and BCM. What none of the articles ever describe is how, in practice, this marriage is to be achieved. A marriage between people of different beliefs is unlikely to succeed without a fundamental shift in the views of one or other parties.
From the methods used by Risk Management one can deduce that practioners view the world as being under the control of a relatively hostile entity who releases incidents into the environment from a predefined stock of events – some are opportunities to be grasped but most are negatives to be avoided. Though the events are released in a random order, they tend to congregate around an average of impact and probability with extreme events being rare and therefore can mostly be ignored. Reactively, analysis of past event is used to identify things that might go wrong again and appropriate measures taken against those events. As analysis and data collection improves over time the Risk Manager would anticipate being able to make more and more accurate predictions and therefore improve on the ‘comparative mastery of risk’ that is claimed but continuing failures challenge.
In many ways this parallels the Newtonian view of the world that appeared to describe everything from the atomic level to the rotation of planets and galaxies with neat equations until ?? upset this ordered view with what became known as chaos theory…..
Do we just need more statistics and knowledge or more experts and risk consultants? In ‘The Black Swan’ by Nassim Taleb he states ’We can’t get much better at predicting, we can only get better at realising how bad we are at predicting’. He made a lot of money on the stockmarket by ignoring market intelligence and following random hunches. He describes an experiment in which a young girl beat a stock market expert by choosing investments in an entirely random way.
The BCM sees the world as anarchic, not subject to prediction. A set of random events sometimes conspiring to cause mayhem – and from sources that are changing faster than they can be understood. The good fortune (the failure of a competitor for example) he leaves for others with the requisite skills to take advantage of the situation. To the BCM small incident is accepted and treated as a learning opportunity. Instead the proactive BCM prepares their organisation though a rehearsed response as the defence against any adverse incident and then looks by adapting to the new reality by taking advantage of the opportunities that disruptions often throw up by using a deep knowledge of the business acquired through the BIA.
Which is right? Both or neither are possible answers. We need to work from a simplified model of the world to survive – reality is far too complex otherwise. The risk model has so many flaws in its assumptions and methodology that it is probably only fit to be used in the context of well understood core business ; and even there with care. These flaws were cruelly exposed in the events that led to the financial meltdown – yet the call has gone out for more of the same! Recognising these shortcomings the latest GPG relegates the assessment of specific threats to the resources required for the most urgent business activities, does not attempt to quantify their probability – and even then insists that a plan is in place in case the ‘preventative measures’ do not work.The BCM view of business threats is far more robust and comprehensive by not focusing on specific causes.
So what should the relationship be? Rather than a marriage and the constant fierce rows that will result, a friendship based on mutual respect and understanding and keeping to each’s rightful place.
Since this article was written ISO 22301 has been published and refers to 'prioritise' activities - which is absolutely correct!
As BC Practitioners we often refer to the importance of resuming critical functions rapidly after an interruption, with the implicit assumption that the remaining non-critical activities need little attention. The BCI Glossary describes 'Mission Critical Activities' as
The critical operational and/or business support activities (either provided internally or outsourced) without which the organisation would quickly be unable to achieve its business objective(s) i.e. services and/or products.
The Good Practice Guide points out that 'It is the Mission Critical Activities and their dependencies that enable the achievement of business objectives i.e . services and products. It is upon these activities that BCM must be focused'; This advice seems self-evident and enables the BCM expertise and budget to be targeted where it is apparently most needed.
The term 'critical' has a number of meanings but the one we have presumably borrowed is that from physics as in 'critical mass' which refers to a minimum amount of fissile material required to maintain a chain reaction, rather than the alternative 'rigorously discriminating'. Unfortunately common usage has added the implication of 'extreme importance' to these dictionary definitions and it is this that causes difficulties in using the term 'mission critical activities'.
A Business Impact Analysis should enable us to identify these 'Mission Critical Activities' but when thoroughly undertaken what emerges from the analysis is a complex web of interaction between business functions and organisational objectives. The connections of some functions to business objectives may be subtle but it would be rash to declare them unimportant and not necessary to consider in a resumption plan. (There is a parallel here with the original approach to IT recovery, in which data restoration would be limited to key data sets - but this was rapidly discredited once the complex interrelationships that exist between datasets class="GramE">was mapped). It should come as no surprise that every function and every person in an organisation is 'critical' in some sense after economic constraints have led to years of down-sizing. If there are people doing work which is not important to your organisation, then why are you still employing them?
In trying to determine the urgency of an activity it is tempting to ask the manager'How critical is your function to the business? We understand the purpose of the question, but does the interviewee? It could easily be interpreted by them as 'Is your job really necessary (if not then you could be made redundant)' in which case there is every incentive for them to exaggerate the importance of what they do.
The use of 'critical (or key) functions' can also create problems in implementing a continuity management programme. An Alex cartoon (in the Telegraph) showed the pinstriped character walking along the road with a colleague complaining that 'I don't know how I can show my face in the office again.... the indignity....I have just been designated as non-critical by the BC Manager'. By being divisive about there being 'important' and, by implication, 'unimportant' jobs and staff we are potentially creating rifts which an incident may ruthlessly expose. Successfully managing an incident will rely on the co-operation of all staff including those who are apparently 'non-critical' even if their role is just to keep out of the way for a while.
To resolve this problem we must recognise on what criteria we are trying to differentiate functions. As an example, the function of 'actuary' is critical to the success, or otherwise, of a life insurance company (as any Equitable Life pension holder will agree) but the office cleaners are, surely, not important. However, the timing of the implementation of actuarial decisions is unimportant on a scale of weeks or months whereas an uncleaned office could become a health and safety hazard in days. So the criteria on which we are differentiating functions is actually their urgency not their perceived importance or status. Some quite low-status functions, such as sorting the mail, need to be resumed urgently whereas some high-status functions, such as strategic planning, can wait for a while until their continued absence makes them urgent too.
The BIA will usually identify a continuum of resumption requirements over time across all business functions, not a split into critical and non-critical. The strategy developed from the BIA will typically require a small team to resume the most urgent functions, then staff numbers will need to increase as further functions are added, usually forming an 'S' shaped curve (as shown in the graph) with a tail made up of the least-urgent strategic functions. The challenge is then to match the provision of building and equipment resources to this growing requirements over time until all functions are resumed. By only considering the requirements of supposed 'mission critical functions' an effective limit is placed on the length of interruption for which the strategy is appropriate and there remains the possibility that a low-status but urgent task has been overlooked. It may not be possible to acquire suitable space quickly enough to accommodate those undertaking tasks which, as time has elapsed since the incident, have now become vital.
The simple solution to this difficulty is for us to use terminology which is less ambiguous to those outside the discipline. We could ask how urgent a particular task is and be easily understood. Alternatively if current usage of 'critical function' is too engrained, then could we preface it with the vital qualifier and use 'Time-critical function' instead? Continued use will act as a constant and timely reminder to us, and those we work with, of the critical parameter of our discipline.
A trip to the International Symposium of Business Continuity proved to be an unexpected practical experience of disaster management.
The 12:27 London to Bruseles on 17/10/01 entered the tunnel at about 2:00 with the usual announcement that we would be in the tunnel for about 20 minutes. However within a few minutes the brakes came on hard and we came quickly to a halt.
There followed frequent announcements of 'final checks being made' but it was forty minutes before we moved again, very slowly. However within a few minutes we stopped again. A further announcement was made that 'we may have to terminate our mission' a use of terminology which several passengers found very worrying.
The problem had been caused by part of the brake assembly on the front coach which had snapped causing the brakes to come on. The attempt to move the train slowly despite the jammed brakes caused the front coach to fill with eye-watering fumes which prompted the staff to order the evacuation of the front of the train. This was hampered by a man in first class who deliberately blocked the gangway for reasons that could not be discerned.
There was then an announcement that the whole train was to be evacuated with only hand-luggage to be removed (some warning of this would have considerably reduced the chaos when we reached Brussels). The evacuation of all passengers through the rear carriage was handled very professionally as it has, no doubt, been well rehearsed. We were led through a short corridor into the service tunnel which runs in parallel between the two train tunnels. There was a reassuring presence of many emergency service personnel and paramedics though, apparently, there was a fatality due to heart failure.
Around two hundred passengers stood in the service tunnel about two hours with little information apart from very loud and unintelligible tannoy announcements which were quite frightening. Eventually we were led through into the other train tunnel to board a replacement train to continue the journey to Brussels, now about five hours later than planned.
On arrival at Brussels the fun started as a couple of staff tried to sort out a catalogue of missed connections, lost luggage and lack of accommodation. A few budget travellers spent the night on the benches in the Eurostar terminal. Those at the conference had to shop for clothes in between frequent trips to the station for news. Fortunately our story became a talking point rather than a reason for exclusion from the conference for inappropriate dress. A newspaper reporter had been on the train so there was plenty of coverage of the incident each day. Luggage was finally returned three days later having been shuttled between Calais, Waterloo and Brussels.
Although train breakdowns in the tunnel do happen, apparently this was the first where it was impossible to move the train. As in many such incidents the emergency response was exemplary but the subsequent attempt to 'return to normal' (business continuity) showed the organisation unprepared for this easily predictable situation. Kits of essentials and meal vouchers handed out on arrival would have demonstrated preparedness and control. Also missed was an opportunity to control the press coverage (mobile phones didn't work in the tunnel much to some passenger's surprise) but the news was spread very quickly once we surfaced. Far from putting me off this mode of travel the experience has increased my confidence in the safety of using the tunnel to reach the continent but I now make sure that carry essentials in hand luggage and label all my bags.
Presented at the Business Continuity Management Forum 2002 - The London Chamber of Commerce. 26th April 2002.
The seven months since of September 11th have seen the publication of many stories and reams of comment on the impact of those events. I do not intend to go though those events in detail nor do I claim special knowledge of any of the organisations caught up in the disaster. Instead I plan to examine how some of the stories that have emerged from the events in Manhattan could be influencing your continuity planning and strategy particularly if you are based in a city centre.
Should we work in tall buildings?
Within an hour of the first plane striking the World Trade Centre, I had a call from a Financial Times reporter asking for information on how the risks for tall buildings were assessed. It appeared that the risk of working in tall buildings had suddently increased. Should all high buildings be immediately and permanently evacuated or would the risk have decreased tomorrow? A number of tall buildings in London were evacuated immediately after the attack as a precaution but were rapidly reoccupied to be followed in a number of companies board rooms urgent discussions about whether or not to relocate and I suspect many staff meeting places too.
Firstly can we determine the risk?
The comment from the FT correspondent highlights the problem of using risk analysis to try to plan for rare but catastrophic events. There have been very few aircraft collisions with tall buildings and, fortunately, few major fires or explosions in skyscrapers so there is little historical record to go on. Any analysis tool which claims to be useful or scientific should give stable and replicable results, but the reliance of risk analysis on historical events and personal perception for an assessment of probability and the impossibility of identifying all threats are fatal weakness of this method. It sounds undeniable in theory but breaks down when you try to apply it in practice. The result of the method is dramatic swings from low risk to high risk after single incidents with a gradual falling off until the next incident. While perhaps suitable for determining levels of alertfor setting daily policing and surveillance, risk analysis does not provide a stable platform on which to base the longer term decisions such as facility location and IT strategy.
However the perception at the moment is that tall buildings may be dangerous so why do we build them?
Building tall is a partly a response to the higher cost of land in the centre of cities, known as the bid-rent curve. Developers maximise their revenues by providing as much space as possible at the highest rents and because the land they have bought it expensive. However as the building gets beyond about 50 floors there are diminishing returns as the space taken up by lifts and complexity of the utilities for the top floors takes an disproportionately high proportion of the floor space of the lower floors.
But developers build higher than pure economics justifies - the WTC was nearly 100 stories. The driving force for going higher is prestige - effectively by taking space you are advertising your company's strength and stability to your customers.
So is working in a giant billboard safe? It would seem intuitively obvious that a tall building due to the lack of opportunity for firefighting and escape available is more dangerous. However risk analysis may tell you that you probably safer from some of the hazards that afflict those closer to the ground for example, burglars, ram raiders and fires from adjacent industrial processes.
So I think that baring a spate of copycat strikes, this demand for high rise office locations will continue as long as Boards decide that the intuitive risk evalution equation comes down heavily in the favour of prestige and advertising. Certainly there is no apparent evidence in London of a slowdown with a new development planned for London Bridge, nor has their been a mass exodus from the City though a quiet removal of critical equipment may be underway.
So what BC issues are posed by tall buildings?
So given that occupation of tall buildings will continue what are the business continuity issues they pose and how can these be mitigated?
There a number of fairly obvious resilience-related statements about an organisation that occupies a tall buildings or is adjacent to tall buildings.
- Because of the capacity of tall buildings, even a small incident could have an impact on a huge number of people and a number of organisations
- Multiple occupancy creates difficulties with evacuation plans and security
- The collapse of a tall building is likely to cause damage to wide area
- You reduce the reslience of the office space by being in the air - there is no chance to park a generator in the car park or to open a back door as a temporary entrance
- There are added threats of service interruption - lifts, long cables, water leaks and supply failure
- These are common sense and should have been fed into the BIA and BC Strategy of those in the WTC.
So what can we learn about these issues from the events of September 11th.
Evacuation and assembly points
There was no doubt that the regular evacuation drills instituted by companies are the 1993 were instrumental in saving 18K people who escaped before the buildings collapsed. Compare this with the 6hrs it took after the 1993 explosion. The owners of the building, the Port Authority of New York and New Jersey, had taken a strong lead in encouraging drills and installing emergency lighting. This is in contrast to the indifference to which some landlords treat the complex issues of building security in some UK multi-occupancy buildings. Does your building have a rehearsed evacuation plan?
For those that escaped, the initial confusion was exacerbated because most companies nominated an assembly point for staff that were too close to the building. So once the danger was appreciated staff fled to other parts of Manhattan or went home which made it impossible (given the loss of communications) to account for those who might have been to be in the building. In contrast, Morgan Stanley had chosen a site 10 blocks away in Verrick Street 800m which though still just within the area of flying debris appears to have been sufficiently far away on this occasion. Congregating in one place will have enabled the company to quickly identify survivors and possible casualties, give support and devise and rapidly diseminate a response.
The map shows the the extent of the damage in concentric rings around the WTC. Within 200m the devastation was total from collapse and fire. Within 400 metres there was structural damage to buildings from flying debris and extensive dust damage to air-conditioning and electrical equipment. Beyond this up to 800 metres there was damage to glazing and roofs from smaller projectiles and dust with power cuts lasting up to five days. Significant quantities of dust reached up to three kilometers from the site.
This detailed map shows the actual buildings and the extent to which they were damaged. Those in blue in the centre have collapsed but there was major damage to a number of the key financial buildings around the World Trade Centre - and also to the Police HQ (bottom right) and the NY Telephone Building (top left).
This map shows the same concentric rings of damage on the same scale but centred on Bank station. If your offices are in the square mile then try to place your staff assembly point on the map. Will it be accessible if a similar event happened in London?
Locating a suitable assembly point for staff in advance of a major destructive incident is always problematic. If it is a false alarm or a minor fire you don't want the disruption of your workforce crossing half the City on foot, however if it is a major incident the usual assembly point 'across the road' will be overtaken by events - staff will dispersed by the Emergency Services to what they consider a safe distance - which may be up to 800m for a large explosive device and police cordons will prevent access thereafter.
It should also not be where everyone else is planning to go - a survey in the City showed that the Monument was the chosen assembly point of companies with a combined total of 150,000 staff. In the City the cordon area is pre-planned therefore it is possible to nominate two locations - one close and the other outside the cordon if the first is inaccessible with clear instructions to staff.
Given the English weather the chosen location should, ideally, be under cover and offer refreshment facilities - quite a tall order but such places do exist.
Emergency Control centres
Similar criteria of distance must apply to your ECC or incident room. This is a facility from which you manage the disaster if your main site is inaccessible. It may need to be no more than a single room that contain contact information and communications equipment but it obviously important that it must remain unaffected by the incident that has caused it to be brought into use.
In the case of Manhattan the Emergency Services' control centre was at the base of the building protected in a bomb-proof bunker which had proved itself apparently by surviving the 1993 bombing. With the World Trade Centre being a prestige location, a known target and a key facility it was putting a great deal of faith in 'bomb-proofness' to use this site. It might be reasonably be anticipated that a lower profile, more peripheral location might be more appropriate in a wider range of circumstances than a single bomb. The loss of this control room significantly hampered control in the early stages of the disaster until the back-up location could be brought on line.
Likewise the response of a number of organisations in the twin towers were serverely curtailed by the choice of location for their back-up media and facilities, compromising their resilience with a desire for operational convenience. They located facilities either on a different floor of their tower, in the other tower or in other buildings close by. Neither would qualify under any business continuity definition as 'off-site'.
So where do you locate your emergency control centre? It is not just damage that has to be considered. Any incident is bound to create gridlock and problems for public transport in the vicinity so half an hour's walk might be a good guide - about 1.5 miles.
One should also consider the location of other facilities and resources required post-incident. One company in the twin towers had to wait two days for access to their back-up tapes because their data storage company were too close to the incident site so had their own problems to sort out before they were able to assist others. There is a ever-present tussle between the convenience of day to day operations and the survivability of the company in an extreme event.
A number of communication issues were exposed by the damage in Manhattan. In the hours following the attack almost all fixed line and mobile networks in the vicinity of the WTC were inoperable. One large switching centre was destroyed and one heavily damaged. This then exposed a number of choke points which were unable to cope with the extra traffic forced through them. The cellular system, always the obvious emergency fall-back in BC plans, was severely overloaded both because of the number of calls and the damage to mobile masts and repeaters in the area. Sprint helpfully led two ("COWs") into the area to replace damaged mobile masts but there were continuing problems. When I first read this I hadn't come accross this acronym before and I couldn't see how two bovines were going to assist - until it was explained they were mobile 'Cells on Wheels'.
Of course there was wider disruption to businesses due to the power failures which knocked out telecomms equipment. Many key items of equipment will have been protected against power spike and short outages. However Uniterrupted Power Supplies provide protection only for a limited time - until their batteries have discharged - and generators only continue while they have sufficient fuel. Air conditioning units, which may be vital to provide a safe working temperature for equipment are often not connected to the protected supply so their failure causes the equipment to overhat and fail. In Manhattan many generators and air conditioning systems were put out of action by dust and debris blocking air intakes and jamming cooling fans.
These communication difficulties have led to discussion about the need for the US Financial service industries to have a dedicated, bomb-proof network. In my experience there is still much that could be done, relatively easily to ensure that existing telecomms systems are installed and maintained with higher resilience standards before further expensive systems are considered.
The loss of communications caused huge difficulties in accounting for staff. As already shown, the nomination of an assembly point can reduce this reliance but further resilience can be gained by giving and rehearsing standing instructions to staff about what to do in an evacuation, if they hear police advice not to travel or in reaching a cordon on the way to work.
The internet, whose origins were in providing a decentralised system that could withstand nuclear attack, might be expected to have provided the ultimate in resilient networks in this sort of crisis. Yet a massive surge in Internet usage around the world overloaded servers and routers and the found bottlenecks created by damage to the network. As a result most people went back to the TV for information.
Those directly affected by the telecoms outages saw their websites (where these were at a single location) out of service for days after the disaster, resulting in major losses for e-commerce related businesses in the area. Many of the rules of diversity of equipment and network providers have been neglected in the rush to e-commerce.
In the first few weeks an intense spotlight was on the perfomance of third-party recovery services. IBM supported over 100 clients, Comdisco 46 and SunGard 30. None lost any recovery facilities in the incident. On the whole the feedback has been positive and all client invocations were 'facilitated'. However behind the press statements lies a number of concerns:
- Comdisco is utilizing 13 of its U.S. recovery centers
- SunGard is currently supporting subscribers at five facilities along the eastern seaboard and one in Chicago
Each of those clients will have expected a full service and planned to be using the recovery centre nearest to the World Trade Centre. A DR manager who called eight minutes after the first plane struck was told he was 11th on the list. Yet these people have signed contracts that stipulate an allocation of resources on a 'first-come, first-served'. Certainly the disruption for relocated staff will have been significantly more than anticipated. What happened to exclusion zones that are supposed to protect clients against multiple invocations?
The other contract clause that many will have signed relates to guaranteed occupancy of the recovery facility for a maximum of eight weeks. This term is a historical hangover from the time when contracts covered IT equipment only. After that, if another client invokes, you have four hours to vacate the facility - in other words you ought to have moved into other premises before that deadline. In discussion with one of the suppliers last summer I was assured that no disaster goes beyond eight weeks and that that is plenty of time to sort any problem out. While it may be true of most IT equipment, its not a long time to find new premises and equip them as well as recover business processes. It will be interesting to see whether the experience of this event forces recovery companies or their customer to review that stance.
Paradoxically one recovery supplier, Sema, which in contrast operates an 'equitable share' policy for concurrent invocations and maximum six month tenure, though put on standby by several customers received no confirmed invocations. With the agreement of its own clients, it offered its facilities to clients of the other recovery service companies.
The other area where companies slipped up in New York is in seeking to provide alternative facilities for IT staff or those identified as 'critical staff' only. Not only is this somewhat divisive (as illustrated in the cartoon seen a few year's ago in the Daily Telegraph's Alex strip) it ignores the fact that ALL functions in an organisation are critical - else why are you doing them? The missing dimension here is time. Everyone in the organisation is critical, but some are more time-critical than others. I do feel this is a key mistake in BC planning. A manager asked if he is 'critical' will immediately go on the defensive and try to justify his importance, asked instead if his role is time-critical he is likely to give a much more considered and realistic answer. As an example take the skill of an actuary in a pensions company. His experience is absolutely critical to the success of that company but if that function is not available for a few weeks the impact will be minimal. However, ignore it for a few months or more and there effect on the company could be catastrophic.
So the result of the loss of the buildings in New York and the severe underestimate of space required was a desperate search for alternative premises and hotel accommodation was snapped up rapidly. About 16 million square feet of offices (about 20% of the total) were destroyed and another 12M in the cordonned area, The economic downturn meant that a similar area of office space was available in Manhattan unfortunately it was mostly suitable for small to mid-size firms, and could not accommodate the trading floors, newsrooms and other large open spaces needed by major companies.
The practical advice to glean from this is to have thought about the build up of staff from day one onwards for up to three months - or how ever long you think it will take to find, buy, equip and network a suitable replacement office. Even if you cannot afford to lease the required space and keep it empty at least you will have a headstart in knowing what you require even if other companies are chasing the same facilities.
In New York the Regulatory Authorities are considering whether to designate certain areas of the city for backup sites in an attempt to get companies to return to Manhattan yet to reduce overall exposure... and this from the land of free enterprise. Tim O'Brien from the FSA actually raised the concentration of recovery centres in Docklands as a specific risk to City companies.
There have been few public details about the effect on computer systems though one recovery supplier said it was using just about every platform available. What did occur even in the UK was panic buying of equipment. What price recovery plans based on purchasing equipment post-disaster through normal channels? Anything that disupts production or distribution is likely to make things worse.
Accounts of technical recovery problems have also been understandably few but there have been general comments that shortcomings in many company's back-up regimes have been highlighted. Either data was found to be missing or corrupt or there are synchronisation problems - that is the back-ups have been taken at different times on different systems so there incompatibilities between the restored files. These stories suggest that there has not been enough invested in installing the required quality of back-up regime (such as offsite mirroring) and end-user testing to identify these problems.
The achilles heel of many recoveries is the loss of paperwork. Pictures from previous events have shown papers strewn all over the area - confidential documents and work in progress. Marsh's heavy investment in imaging paid off with little more than that day's work lost.
The Cantor story is one of the saddest corporate tragedies to come from this attack. They reportedly lost 70% of their workforce. It is difficult to envisage any plan that can address a loss on that scale. Another company lost a nearly all their worldwide business continuity staff who were at a conference that day.
However there are many companies who could be crippled by losing one or two key staff with specialist skills or knowledge. Whatever those skills, be it in their contacts, knowledge or a technical area it makes sense to have a clear programme of multi-skilling and the spreading of information. This has obvious benefits in a crisis where the ideal person for a task may not be available. On a training course I rang, on delegate related the anecdote that his CEO was heard to threaten that if any member of staff made themselves indespensible then they would be fired.
Should we apply the same logic to staff as we do to technology and make sure that there is some duplication and overlap at multiple locations. Perhaps the aim should be to create an organisation that is geographically and functionally diverse enough that the loss of one location cannot cripple the business. In the current economic climate this may be difficult to argue but is the logical outcome of the requirement to create resilience. Maybe the 'Distributed Headquarters' is the structure for the future but perhaps communications (both data and transport) will have to become more reliable before that can be realised.
All of the reports of organisations recovering highlight the crucial part played by their staff and suppliers - causing 'miracles' to be accomplished. Contrast this with the general lack of support for recovery activities and personnel from management level in companies prior to the attack - and now in most organisation I suspect now things have apparently returned to 'normal'.
How many recovery plans depend on the ability to move staff or computer back-ups rapidly to a remote location? In the US that usually means an internal flight and I know of at least one UK company where internal flights are a key element of the plan. Movement of people away from Manhattan was severly hampered by disruption to the subway, the road system was gridlocked, and bridges were closed and all flights were grounded. The situation was made worse by rumours, hoax calls and general unease. Recovery plans that rely heavily on rapid relocation are always questionnable because there are so many factors outside your control.
The initial grounding of aircraft in the US and the subsequent hurried implementation of security measures led to severe disruption of air freight. Cargoes were held up for long periods while equipment was installed and staff were trained to meet the hurriedly-drafted new rules. Ford, reliant on single sourced JIT supply, was forced to close some production lines due to lack of parts. There were many other significant delays to deliveries across the world - in November an African country was running dangerously low on phone cards due to delays with scanning equipment at Heathrow.
There is still a reluctance to fly in the US which I find illogical - its the problem with risk analysis again. Americans used to board aircraft with the same level of ground security as boarding a bus. Yet my town's Easter youth music festival has been cancelled because the US' school bands still wont fly due to the perceived danger. This is despite the improvements in security which now approach what we were used to in Europe. Still, there has been some environmental benefits of reduced Co2 emmisions from fewer planes! The railways in the US have had a sudden resurgence of passenger numbers - but their infrastructure is even worse than that in the UK - there have been two fatal crashes recently.
As a result of travelling difficulties there has been an upsurge in video conferencing & telecomuting which has been a lifeline to stuggling telecoms companies. Home working is often pointed to as a neglected area of BC strategy but there are significant logistical and managerial problems. It is only really a suitable recovery strategy if it also a business strategy. The problems need to be ironed out in normal working conditions if there is to be any chance that it will work in a crisis.
There are increasing pressures on all business, but particularly the financial sector to prove their resilience. Turnbull is usually quoted as the driving force here but by focusing on all risks to an organisation there is a danger that catastrophic threats become discarded into the 'very unlikely' category in the risk analysis and thus ignored because of the serious implications of implementing appropriate counter-strategies.
There may also be pressures from staff, expressed through resignations or recruitment difficulties, concerned about their personal safety when working in high buildings. After the Docklands Bomb two thirds of the employees of one company whose building was seriously damaged resigned on the grounds that they no longer felt safe in Docklands and blamed the company's managers for not protecting them. A survey by globalcontinuity.com shortly after the tragedy showed 43% respondence strongly against working in high-rise offices, though as memories fade this will have already dropped significantly. Yet who is going to be the first person to sue their company for the stress caused by making them work in a tall building?
So many financial and management pressures conspire to lead businesses to consolidate and agglomerate, it may require significant pressures to encourage diversification - perhaps this is an area where insurance companies could take a lead in looking at underwriting policies that discourage undue concentration of facilities in a single location.
How big a bang?
The explosion at Canary Wharf did severe damage within about 200m, the Manchester bomb about 400m and severe damage stretched 800m in Manhattan - and now the threat of a nuclear or biological device? So how big will the next one be - what is our worst case scenario or are we wasting money and effort on trying to be bomb-proof.?
To keep this in proportion I think you need to consider how, in an extremely serious incident, the Emergency Services, your staff and other companies will react. Work and possible relocation comes well down the priorities when human life or family welfare is at stake. A major incident may lead to the intervention of the military and the taking of emergency powers. A catastrophe may destroy or alter your market. For these reasons there is a sensible limit to the size of incident for which it is worth planning, what I term a 'Maximum Survivable Incident'. Like the advice given if you stumble across a number of bears while walking in the woods, it might be best to act 'dead' for a while until the market is ready to resume.
Those companies who had such excellent plans that they were ready to trade on 12th September on the NY Stock market might have been rather annoyed that the Stock market was closed for several days. This is an area where agreement between regulators, agencies and financial services companies on the conditions under which trading would be suspended could save substantial investment in facilities that could be unnecessarily resilient.
I have concentrated, as expected on the lessons from the fallout of September 11th but I want to end with a caution that we allow the focus of continuity to concentrate on bombs and explosions. We must also not forget that the London's environmental foot (boot?) print is huge ; It depends on services and people from a considerable area. A major utility disruption such as contaminated water supply, a prolonged power or transport failure could cause a much more widespread and disruptive incident than a single explosion. As well as a terrorist surprise, man-made and natural disasters have plenty of potential to cause us challenges in the near future. The only prediction I am prepared to make is that the next major incident when it comes will be unexpected but the companies with the flexible rehearsed plans will be able to cope with whatever the incident throws up.
I set out to analyse how September 11th had changed BC thinking and outlook. In illustrating what lessons can be learned from the incident and subsequent recovery efforts, I conclude from the many recovery successes that the core principles of BC such as - resilience, duplication, dispersion, planning and testing held up well, though the reminder of the crucial role of staff was timely. The problems came where the rigorous implementation of these principles was compromised by an unsound risk assessment and demands for day to day operational efficiencies and a neglect of thorough user testing.
I hope I have pointed out some significant lessons from the information presented that you can take away with you to enhance your own Business Continuity planning.
Thankyou for listening. I am happy to respond to any questions or comments.
© Continuity Systems Ltd
What do we really mean by a Worst Case Scenario?
'Worst case scenario' is a term which is regularly used as a convenient shorthand by BC practitioners to describe the severest incident covered by the continuity plans they develop. However with an apparent increase in the severity of extreme weather conditions, and the ever-present threats of tectonic and man-made catastrophes, this term may need to be used more carefully or even discarded. We should consider whether its use could give an organisation a misleading impression of the scope of the incident for which their continuity plan provides protection.
When assessed on a scale that starts with obliteration of the earth by an asteroid, through global pollution by radiation, a massive earthquake event to widespread flooding, the isolated loss of an organisation's head office by fire or flood is a smallish disaster which can scarcely be called 'worst case'. Its impact, while serious for the organisation, is at the local rather than regional level.
We make similar assumptions about loss of life. When faced with the difficulty of envisaging a catastrophic incident, we tend to use the 'denial of access' scenario and call it a 'worst case'. But if there are many staff casualties it may not be possible to recover the organisation successfully; an attempt at a rapid recovery may even look heartless.
Most BC plans are designed to cope with a local incident by providing existing staff with alternative facilities within daily travelling distance of the main site. The Ice Storm in Quebec prompted a reappraisal of assumptions about the optimum distance of recovery centre from a site as the weather paralysed a region 300 miles in diameter for six weeks. This raises the question of whether an organisation should maintain expensive and costly continuity plans that can cope with a major regional or even national emergency or select a less-than-worst-case scenario and hope that a more serious calamity will not happen.
One approach would be to decide to exclude catastrophic incidents on the grounds that their probability of occurrence is so low as to make them not worth considering. This risk analysis approach is unsatisfactory both because we cannot accurately measure the probability of rare events and their rarity is of little comfort if one does actually happen. Extensive power cuts hit Scotland twice in one week after Christmas 1998 due to storms described each time as 'once in thirty years' and there is no historical precedent for the recent flooding in SE England.
A more pragmatic approach is to examine the experience of organisations faced with regional scale disasters such as hurricanes and earthquakes. In this scale of event, an organisation's ability to recover may be hampered by the response of the emergency service whose priority of public safety may conflict with the organisation's need for staff to get to work. If the incident has already caused heavy casualties or is threatening further destruction, then even key recovery staff may feel that their priorities lie at home rather than at work. In an extreme case, facilities such as generators, communications and buildings set aside by the organisation for recovery use may be commandeered by the public authorities for community priority needs.
It is suggested that the solution is to remember that the aim of the continuity plan is to provide a means by which an organisation's objectives will continue to be met even when an incident threatens to derail them. There is an assumption here that business objectives will remain the same despite the interruption, and thus a predetermined recovery strategy can be adopted. It follows that if the incident is of such magnitude or such wide extent that the current organisational objectives are no longer relevant, then a pre-planned recovery strategy is likely to be inappropriate. A local council officer in Yorkshire illustrated this succinctly when, questioned by a councillor on the plans to collect council rents in the event of a nuclear attack, replied that there were none because no-one would be around to worry about it.
This reasoning can also be applied when considering how far apart buildings have to be before they can be considered to provide resilience for each other. For city centre buildings more than a few hundred metres apart to be destroyed simultaneously (other than by sabotage) would require a major explosion, flood or weather event and could result in many casualties. The emergency response to this magnitude of event could last for several days and would have priorities of safety and welfare which would severely impede any business recovery efforts at either location however carefully this has been planned. In addition staff living in the area may be having to cope with damage to their own property. Should this organisation have recovery facilities, or a second head office, at a distance? And if so, then how far distant should they be - in a different region or even another country? And even if it had those facilities would the organisation's reputation survive the press coverage?
It is suggested that each organisations should consider the extent of their 'Maximum Survivable Incident' (MSI) - the most serious and extensive disaster they wish to plan for beyond which no predetermined strategy can be expected to be appropriate. This MSI should then be used in place of any 'worst case scenario' in the Business Impact Assessment which will then ensure that an appropriate recovery strategy is chosen.
Some organisations, particularly the Emergency Services and local authorities, will necessarily have a large MSI since they have welfare responsibilities to a large area and will need to spread their facilities appropriately to prevent a single incident rendering all their resources inoperable. However at some point on the scale the incident will become serious enough for national and even international resources to be deployed and to replace local efforts.
For commercial organisations their Business Continuity Strategy should be developed around a Board-agreed Maximum Survivable Incident scenario. If the Board demands an ability to survive a regional or national crisis, even if these involve major casualties, then the organisations resources must be dispersed accordingly. If it accepts a more limited disaster as their MSI then appropriate continuity plans can be implemented without the additional overheads of trying to provide for major incidents which would take the means of recovery out of their control and make their business objectives obsolete.
Knowing when to admit defeat in a disaster is vital for an organisation. Expending effort on an attempt at recovery from a catastrophe greater than an organisation's MSI is futile and may mean that opportunities opened up by the incident are overlooked.
Ian Charters is an independent Business Continuity Planner with six years experience in assisting companies and other organisations to develop appropriate recovery plans. He also presents workshops and seminars for Survive. Ian is a Member of the Business Continuity Institute and a member of the Emergency Planning Society's - Business Continuity group.
P&O Stena Line's computer systems that manage the loading of ferries at Dover Docks is highly resilient, being split between two data centres, two miles apart with back-up circuits. However when a technical fault crashed the system in September 1999, police invoked part of their Operation Stack emergency plan which involved parking all the lorries on the M20. The interruption was significantly prolonged because the temporary lorry park and resulting traffic chaos delayed both technical staff and replacement equipment from reaching either site.
From the perspective of ensuring Business Continuity in an organisation, to understand the plans and powers of the local authority and emergency services could mean the difference between recovery success and business failure in an emergency. Premises owners can be denied access to a building and its environs by the emergency services where there is concern for safety or where evidence of a crime may be destroyed. As a last resort, equipment and facilities can be commandeered, though usually requests for voluntary assistance is the preferred route as in the 1999 French storms.
One of the vaguest points in most business continuity plans is how the organisation's staff will work with the emergency services to handle an incident and then retrieve control of the site from them.
There is a wealth of practical experience of these issues in the heads of the country's Emergency Planning Officers gained from recent experiences and in many cases augmented by colourful past exploits. Were James Bond ever to retire, he would not find himself out of place in that sector. EPOs may find banana skins lurking around every corner but they also share with the Business Continuity profession a determination to learn from every incident - even if is dealing with something as unexpected as a dead beached whale. Inviting the local EPO to attend and comment on your company's Business Continuity exercise can be a sobering, even depressing, experience. However the practical experience they freely offer can only enhance the company's ability to cope if an incident occurs.
Responding to the media scrum can become the Achilles heel of a company's attempt to recover from an incident. Many companies have a press officer or nominate a senior manager to brief the press but few have the training, experience or backup to handle the immediate demand for statements. Many companies will lack the media contacts and be unaware of the correct local procedures for issuing press releases, such as the Scottish Lord Advocate's guidelines. Here again there is a wealth of knowledge on these issues in the public sector where disasters, albeit other people's incidents, are handled more frequently.
The BCI standards, formerly 13 in number, were redefined recently to 10 in consultation with the DRII. Some topics were combined but two new ones were added, to reflected the experience of recent incidents, Public Relations and Co-ordination with Public Authorities. This identified the need for companies to improve their recovery planning by collaboration with the public sector services.
The Manchester bomb demonstrated the reliance that businesses place on a the local authority in the aftermath of a major incident. The larger companies were able to call on their own resources and plans to look after their staff and their business. The smaller enterprises could only look to the City council to assist with access, salvage, insurance and relocation issues. The prosperity of an area and its tax base depends on its local businesses so councils have a real interest in assisting businesses to survive.
However it should not be seen as a one-sided relationship. The most significant contribution that private companies can offer to the public sector is their site and staff for use in emergency exercises. They also have the money to sponsor these exercises, since the failure of their plans can usually be shown to have serious financial implications and may be a statutory requirement.
Whilst many local authorities have had experience in handling disasters within their community, few have addressed the impact of an incident affecting their own buildings and staff. As local authorities are encouraged to viewing the running of council services as a 'business', the business continuity expertise developed to identify and protect critical functions in the private sector is valuable. Maintaining an ability to provide an acceptable level of service to customers is both a public and private imperative.
Where hazards are highly visible this mutual interest can lead to a close and on-going co-operation between public and private sectors. Since 1969 the chemical companies that surround the town of Grangemouth have worked in partnership with the local council (now Falkirk Council) and the emergency services. Leaflets telling householders how toxic gas escapes will be notified to them are written and funded by the companies and distributed by the council. Regular training and exercises are conducted with premises provided by the companies. In the event of an incident, the council's emergency control team is supplemented by representatives from each company authorised to contribute their company's resources - for example their private fire engines, a generator or specialist staff - as required. Companies share their experience of an incident candidly with each other so that each can learn lessons from it. With ten major incidents in the last thirty years such co-operation is vital especially with the limited resources at the council's disposal since the fragmentation of local government in the most recent reorganisation.
By making contact with the appropriate local public bodies and working with them, a company can ensure that it understands the responsibilities of the various organisations it must deal with. It can ensure its particular needs are known and how to press their case in an incident. The public bodies can develop and rehearse their plans with an understanding of business needs and be in a position to determine the best balance between public and commercial interests in responding to a major incident.
Business Continuity Conferences nearly always feature at least one seminar on the theme of gaining Board 'approval' for the Continuity Project. Budgets are forced out of reluctant directors by describing the impact of a variety of serious but unlikely events on the organisation. However convincing these scenarios they are always open to being refuted by the observation that business failures are rare and are far more likely to be the result of poor product design, marketing mistakes or financial mismanagement.
Whilst assisting organisations to develop Business Continuity Plans I have observed that organisational changes have from resulted from the project beyond those specifically intended. This has convinced me that a more positive approach to the Board, stressing the benefits of the project to the organisation and the 'bottom line', is more likely to be accepted. However, to achieve this the BC planner may need to interpret terms of reference a little more generously than is customary.
A Business Continuity project will work with key personnel and functions within a business to develop a recovery plan. This is not a one-way process of information collection - these key areas will also be changed by the process. Structures and procedures are put in place in the organisation to enable recovery in the event of an interruption. Many of these are of value in normal business operations too and some are highlighted below.
To be effective a recovery plan needs to be communicated to all staff so that everyone works in the same direction during an incident. Few companies seem to have an effective means of diseminating information to their more junior employees but once briefings are set up as a pre-requisite for recovery planning, their use for more general discussions comes naturally. If these new channels are used well this should result in staff bing better informed and feeling more involved.
A telling point was made during a recent presentation about the recovery of a Scottish distillery from a flood. The speaker said that the teamwork encouraged by the salvage company assisting the recovery had been so impressive that it continued after normal production was resumed and had significantly improved employee relations and productivity. Why do we need a serious interruption to prove the benefits of working together?
Retention of staff
Whilst most business activites are predictable, organisations need to retain at least a few 'dynamic' staff who can react to the unexpected and create new opportunities. Retaining these is a challenge to which the Business Continuity project can offer at least a partial solution since they are often suitable recruits for one of the recovery teams. Joining a team that dons hard hats and high-vis jackets and simulates exciting scenarios can give sufficent challenge to retain them.
Many junior staff in larger companies would say that they feel under-valued. Making clear to staff that key objectives of the development of a recovery plan are to ensure their safety and preserve their jobs should improve their perception.
Recruitment and induction training costs per employee are substantial. Any measure that reduces staff turnover is valuable.
Many disasters could be prevented at minimal or no cost by timely action. Risks, obvious to a visitor, are ignored daily by staff because it is someone else's responsibility or too much trouble to do something about it. Raising the awareness of all staff to the possible impact on their own job of a serious interruption could lead to them taking action that prevents a disaster.
Manufacturers operating JIT production are beginning to realise their vulnerability to supply interruptions. A Business Continuity plan and the guarantees that it allows you to offer and demonstrate to your customers can be used by the marketing department to increase market share, increase prices or boost reputation. An IT facilities management company is now developing a new sales strategy emphasising its ability to provide a service even if its main facilities are inaccessible.
Stand-by buildings and computer equipment are often significant costs in a recovery plan. Alternative uses, as long as they can be cancelled at short notice, can justify this expenditure. A recovery centre can make an excellent training facility with which staff will then be familiar when required for its emergency function. A in-house stand-by computer facility in another location can be used to save an upgrade when year- or month-end reporting are the only peaks that over-load the production machine's capacity.
The results of a Business Impact Review contain information of use beyond its original scope. It attempts to quantify how the organisation will be affected by the loss of key functions. The insight this exercise gives into the operation of the organisation and its reliance on various functions should interest all senior management considering organisational changes or major projects.
Stretching the Business Continuity brief to its utmost, the BC planner should have an input into all strategic business decisions. Contingency measures are always easier to build into a new facility than to devise afterwards. Capital investment is more likely to be available and at lower rates if the strategy is demonstrably resilient to external variables.
Plans to consolidate and centralise facilities may reduce resilience and flexibility to respond both to incidents and market opportunities and this can be highlighted by the BC planner at an early stage before decisions are irrevocable. One specialist reinsurance company shelved its plans to relocate all of its functions in one City building following the presentation of a Business Impact Review. Alternatively a planned reorganisation may free equipment or facilities that can resource a contingency plan at minimal cost.
The above examples suggest there are many demonstrable benefits to
the organisation from undertaking a Business Continuity project which be
can used to win support from the Board though all fall outside the BC planners
usual remit. Are we prepared to become more involved in normal business strategy
and management to support our case for continuity planning?
This paper was written in response to an article entitled 'The ethics of fear in booming continuity sales' by Chris Needham-Bennet (WHERE PUBLISHED?) which defines the role of a BC professional as a risk manger with a little extra scope to their responsibility. His risk manager should have relevant data on probabilities and cost analyses for the industry sector and tackle each risk accordingly. The author expresses concern that business continuity professionals are allowing their enthusiasm to 'lead to a distortion or exaggeration of facts and a reliance on fear in their sales methodology' by putting forward improbable, catastrophic scenarios and are in danger of becoming 'risk folk devils and bogeymen'.
In proposing a strategy for risk management that consists of the identification, prioritisation and mitigation of likely risks he both ignores the weaknesses of this risk management approach and shows a misunderstanding of the objectives of Business Continuity Planning.
The techniques and concepts of risk analysis come from two main disciplines - insurance underwriting and engineering. They both aim to measure risk in terms of an impact multiplied by its probability. By this measure the 'plane landing on the building' scenario is so improbable as to merit no attention.
The insurance underwriter needs to make sure that policy premiums from his portfolio will cover claims with a profit margin. To this end they will use historical statistics on the industries and locations of the insured. They cover the inadequacies of this data by aiming for a spread of risk types and locations and can still get their sums wrong even when aggregating thousands of risks. To seek to apply these aggregated generic historical statistics to the future of a specific site in a specific industry with its unique methods of working cannot be an acceptable methodology.
The engineer takes a small sample of each component of a piece of equipment, tests it to destruction then aggregates the results to calculate a probability of failure of the complete machine. No business can be understood in such mechanistic terms. They are far too complex, change rapidly and are affected by many outside influences. Staff, buildings and procedures are fortunately rarely tested to destruction and therefore estimated failure rates are unavailable.
Given these limitations most risk studies of business interruption derive the probability of the occurrence of each threat from estimates given by respondents to risk questionnaires. The resulting grids, graphs and pie-charts can look impressive but can obscure the fact that the figures are based on guesses however well-informed.
Though many risks are easily identified, analyses of actual disasters show that many result from factors such as human error, a combination of unfortunate circumstances or temporary conditions such as building works. These risks are difficult to identify in advance and impossible to assign realistic probabilities to. The cause of some disasters can even be traced to risk management 'solutions' which have failed or have led to impacts where risk reduction methods led to unexpected failures in other areas. That those in business continuity can always find a pertinent recent disaster example suggests that these threats are not imagined or even unlikely.
The most common causes of business failure are lack of sales and cash flow problems. To this business continuity has no remedy, nor should it since this is the business' core competence. Once the business is on a secure financial footing, however, there is an investment to be protected. The role of the business continuity manager, in my opinion, is to protect the business from adverse events in areas outside that core competence (which is of course different for each industry).
The Board give the BC manager the responsibility for ensuring continuity of core business through any adverse circumstances. This may be required either by statute, pressure from customers or accepted best business practice. The challenge is to devise an implement an appropriate strategy that will allow the business to provide a near-continuous service from its critical functions in a worst-case scenario. To determine the meaning of 'near-continuous', 'critical' and 'worse-case' within the company is the first challenge the manager must face.
One necessary simplification made to make planning possible is to consider just a few generic incident scenarios. The choice of realistic and comprehensive scenarios is a further challenge requiring experience. Listing the wide variety of causes and to trying to calculate the probabilities that could lead to that type of situation do not assist the development of the continuity plan in any way, though they could point to measures which might increase resilience.
In practice the task of managing risks is often given to the BC manager but it should be clearly understood from the Board which is the principal role. A list of key actions drawn up using risk management techniques may differ markedly in content and priority from that evolved through business continuity methods. This is because one is aimed at reducing exposure and losses from known threats the other at surviving a worst-case scenario to which occurrence it is impossible predict a cause or to assign a probability. Both lists are valid and defensible and the measures finally implemented will usually involve some compromise between the two. However it should be remembered that a risk reduction measure may work in isolation, but a half-built continuity plan will almost certainly fail.
So are BC professionals just 'risk devils and bogeymen' picking over
and relating each new disaster with relish? The enthusiasm with which we
analyse disasters is not to heighten fear in our clients rather there is
a willingness to learn lessons from the failures (and successes) and to pass
this experience on for others to avoid the same misfortune.
'How do you know that you have created a successful BC plan if the organisation doesn't experience a disaster?' This is a real issue in a business environment which expects success to be measureable. Without a real incident how can one be sure that all the time-critical functions have been identified and that adequate preparations have been made for them?
For the consultant who is invited into a company to 'do' Business Continuity for them because they lack internal resources or expertise, it is an even bigger question. How can the consultant know that they have succeeded? Exercising the plan is vital to assess and improve the readiness of an organisation but is there a level of preparedness at which point the project should be deemed to be finished?
There are attempts to create a common objective standard to assess organisation's preparedness but this has so far proved elusive. Perhaps a common standard will remain so because there is no such thing as a standard organisation and, for the same reason, it is impossible to develop a successful off-the-shelf recovery plan.
The first stage of any BC project must be a Business Impact Assessment from which an appropriate continuity strategy can be determined. To see this stage as a solely technical and statistical exercise loses much of its value. It presents opportunities for the consultant to raise awareness amongst the staff being interviewed and to start to get 'under the skin' of the organisation. Discussing the experience of past incidents can clarify the inherent responsiveness of the organisation and identify individuals who may be able to take on BC roles. It also provides an opportunity to understand how proposals are presented and decisions made by the organisation. Much of this valuable information will be lost if there is no continuity of personnel into the implementation stage of the project.
The success of the implementation of BCM in an organisation by an external consultant will depend to a large degree on how well he has understood its culture. It is almost necessary to 'go native' to understand, for example, in what form a plan should be documented and how staff may react in a crisis. In this respect one of the small number of independent Business Continuity consultants will have the edge since they don't bring the baggage of their own company's structure with them. They are not answerable to anyone except the client and they can rely on their experience rather than their employer's standard methodology. Being solo they provide a continuity of personnel throughout the project yet through an informal network they can call on additional specialist skills if required. The result is often a more inventive and cost-effective solution which makes best use of the organisation's resources.
Experience shows that the wide range of organisational cultures makes such a flexible approach vital. A financial services organisation will spend precious weeks deliberating on the exact number of desks to contract for, only to find its decision overtaken by other events. A chemicals company reckons it can face any crisis without a plan since the management's engineering background limits its perception of a disaster to a plant explosion. A car components factory has to reuse a ramshackle portakabin as an off-site store to keep costs down. An insurance company is used to paying out on other people's disasters, but reluctant to admit they could have one of their own. Within departments of the same company too there can be a variety of attitudes ; we have all met IT departments who consider a mainframe recovery plan sufficient for all eventualities. To respond to these situations requires a detailed knowledge of personalities and procedures within the organisation but combined with the external 'expert' view.
So how, after working in so many complex situations, can the consultant claim success and move on? To hand over a smartly bound and detailed plan is easy but of questionnable value. It creates the impression of a job completed, whereas this plan is really only the first step on an endless journey. What he should be leave behind is a structure that will ensure that the organisation's ability to maintain business continuity will continue to grow and develop. This can be achieved in many ways but its long-term success will depend on the extent to which the Business Continuity project becomes part of the specific culture of this particular organisation.
Success usually involves the general raising of awareness but also the identification and training of a Business Continuity team some of whom may wish to develop their knowledge and experience and go on to seek BCI certification. As the team takes shape the consultant, originally in charge of the project, must increasingly take a back seat and empower the new team to assert their independence, even if they make a few mistakes.
The success of a management consultancy is often rather cynically
judged by how long they can attach themselves to an organisation - with considerable
savings on their marketing budget. Some consultancies offering Business Continuity
services appear to want to provide a recovery team on a permanent basis to
an organisation depite the issues raised of availability, knowledge and responsibility.
However my own personal criterion of success is when the new in-house Business
Continuity Team decides that they now have the confidence to take the project
forward on their own. With a call of 'See you in six months at the next exercise'
I can depart reflecting on a job well done.