PhoenixNAP Blog

Disaster Recovery Plan Checklist: 7 Critical Points You CAN'T Overlook

    

Disaster Recovery Plan Checklist Header

The need for a disaster recovery plan cannot be felt more than in the aftermath of massive hurricanes that recently ravaged along the west coast of the US.

Days-long power knockouts, physical blows and supply chain breakdowns left thousands of businesses in the dark. Most of them are now facing insurance fights and significant infrastructure rebuilds to get back on track.

These are complex challenges that many will struggle to overcome. Yet the organizations that had disaster recovery and business continuity plans in place now have one less thing to worry about.

Designed to enable businesses to cut the damage of unpredicted outages, a disaster recovery plan is a long-term assurance of business operability. While a disaster of this scale is not really an everyday scenario, it is one of those uncontrollable things can be fatal to business operations.

And it can happen to everyone.

In one form or another, natural hazards and human errors are a constant possibility and this is why it makes sense to prepare for them. When you add cyber-attacks to the mix, the value of disaster recovery plan is even greater.

This is especially true when you take into account that the average cost of downtime can go up to $5600 per minute in mid-sized businesses and up to $11,000 per minute in enterprises. 

With every second of outage counting against your profits, avoiding any impact of downtime is a strategic aim. This is best achieved by preparing your entire infrastructure to resist and stay operational even at the harshest situations.

Why You Need Disaster Recovery Plan: A 500 Million Pounds Case Study

While the probability of a disaster may often seem hypothetical, some recent events confirmed that hazards are a real thing. And costly, too.

Hurricanes Irma and Harvey are some of the most striking examples, but a lot of other things can go wrong in business and cause disruptions. One of the cases in point took place earlier in May when British Airways suffered a major IT system collapse. The three-day inoperability left thousands of passengers stuck at airports across the world, while the company worked to identify and fix the error to get their systems back online. The entire saga reportedly caused a 500 million pounds damage to the company, while its reputation is still on the line.

When it comes to business disruptions, it doesn’t get more real than that.

The BA case is yet another unfortunate confirmation of the fact that unplanned outages can take place anytime and in any company. The ones that have no stable disaster recovery and business continuity plans are bound to suffer extreme financial and reputational losses. This is especially the case with those that have complex and globally dispersed IT infrastructures, where 100% availability is paramount.

Events like these call for a discussion on the disaster recovery best practices that may help companies like this avoid any similar collapses in future. Below is an overview of the critical items that need to be in the plan.

1. Risk assessment and business impact analysis (BIA)

The best way to fight the enemy is to get to know the enemy.

The same goes for disaster recovery planning, where the first step is to identify possible threats and their likelihood to impact your businesses. The outcome of this process is a detailed risk analysis with an overview of some common threats in the context of your business. 

One of the good ways to start with risk assessment is to develop a risk matrix, where you will classify the types of disasters that can occur. The risk matrix is essential to identify the scope of damage and understand the events that can be devastating for business.

Risk Management Chart

Resource: smartsheet

After you identify and analyze the risks, you can move to creating a business impact analysis (BIA). This document should help you understand the actual effects of any unfortunate event that can hit your business. Whether it’s loss of physical access to premises, system collapse or inability to access data, this matrix is a base for planning the next steps.

To get started with BIA, you can use FEMA's resource with simple disaster recovery plan template.

2. Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

RTO and RPO are some of the key concepts in disaster recovery planning, whether your data resides in dedicated or virtualized environments. As a reminder, these two refer to the following:

  • The amount of time needed to recover all applications (RTO)
  • The amount of data that you risk losing during disaster recovery, calculated in relation to amount of time needed to complete the process (RPO)

More detailed definitions of RTO and RPO can be found here, but their real-life values will vary between companies. Setting RTO and RPO goals should involve a cross-department conversation to best assess business needs in this respect.

The objectives you define this way are the foundation of a disaster recovery plan and they also determine which solutions to deploy. This refers to both hardware and software configurations needed to recover specific workloads.

3. Response strategy guidelines and procedures

Disaster Recovery Plan Guidelines and Procedures

Documenting a disaster recovery plan is the only way to ensure that your team will know what to do and where to start when a disaster happens.

Written guidelines and procedures should cover everything from implementing DR solutions and executing recovery activities to infrastructure monitoring and communications. Additionally, all the relevant details about people, contacts, and facilities should be included to make every step of the process clear and simple.

Some of the general process documents and guidelines to develop include:

  • Communication procedures, outlining who is responsible for announcing the disaster and communicating with employees, media or customers about it;
  • Backup procedures, with a list of all facilities or third-party solutions that may be used for document backups.
  • Guidelines for initiating a response strategy (responsible staff members, outline of key activities, contact persons, etc.)
  • Post-disaster activities that should be carried out after critical apps and services are reestablished (contacting customers, vendors, etc).

The key to developing effective procedures is to include as many details as possible about every activity. The essential ones are a) name of a responsible person with contact details, b) action items c) activity timeline and e) how it should be done. This way you can achieve full transparency for every critical process in the overall disaster recovery plan.

4. Disaster recovery sites

Putting the plan to work also involves choosing the disaster recovery site where all critical data, applications and physical assets can be moved in case of a disaster. Such a site needs to be able to support active communications, meaning that they should have both critical hardware and software in place.

Traditionally, there are three types of sites that can be used for disaster recovery:

  • A hot site, which is defined as a site that allows “functional data center with hardware and software, personnel and customer data;”
  • A warm site that would allow access to all critical applications excluding customer data;
  • A cold site, where you can store IT systems and data, but that has no technology until the disaster recovery plan is put into motion.

Most DR solutions automatically backup and replicate critical workloads to multiple sites to strengthen and speed up the recovery process. With the advances in virtualization and replication technologies, DR capabilities that are at disposal to modern companies are many. Choosing the right one involves finding the balance between price, technology and provider’s ability to cater to your own needs.

5. Incident response team

incident response team

When a disaster strikes, all teams get involved. To effectively carry out a disaster recovery plan, you should name specific people to handle different recovery activities. This is key to ensuring that all the tasks will be completed as efficiently as possible.

The activities of the incident response team will vary and they should be defined within DR guidelines and procedure documents. Some of these include communicating with employees and external media, monitoring the systems, system setup and recovery operations.

Like with all the other guidelines and procedures, details about incident response team should include:

  • The action to complete
  • Job role of a person responsible for completing the action
  • Name/contact details of a person responsible
  • Timeframe in which the activity should be completed
  • Steps that more closely describe the activity

The Incident response team will involve multiple departments – from technicians to senior management - each of which may have an important role in minimizing the effects of a disaster.

6. Disaster recovery services

Recovering complex IT systems may require massive manpower, hardware resources, and technical knowledge. Yet, many of these can be supplemented by third-party resources and cloud solutions. Cloud-based resources are particularly handy to optimize costs and shift parts of the infrastructure to remote servers, which brings higher security and better use of costs.

In companies where not all workloads are suitable for public cloud, a balanced distribution between on-site and cloud servers is a cost-effective way to configure infrastructure. Similarly, a hybrid approach to an IT disaster recovery plan is ideal for companies with advanced recovery needs.

A particularly convenient option for businesses of any size is Disaster-Recover-as-a-Service (DRaaS), which offers greater flexibility to teams operating within a limited DR budget. DRaaS allows access to critical infrastructure and backup resources at an affordable price point. It can also be used in both virtualized and dedicated environments, which makes it suitable for companies of any size and any infrastructure need.

7. Maintenance and testing activities

Disaster Recovery Maintenace and Testing

Once created, a disaster recovery plan needs to be reviewed and tested on a regular basis. This is the only way to ensure that it’s efficient long-term and that it can be applied in any scenario.  

While a great part of modern businesses now has a disaster recovery plan in place, many of them are outdated and not aligned to company’s current needs. This is why the plan needs to be updated to reflect any organizational or staff changes, especially in companies that grow at a rapid pace.

Also, all the critical applications and procedures should be regularly tested and monitored to ensure they are disaster-ready. This is best achieved by assigning a specific task to the defined disaster recovery teams and training employees on disaster recovery best practices.  

Closing thoughts on disaster recovery procedures

Given the dynamics of today’s business, occasional disruptions seem inevitable no matter the company size. The major disasters we’ve seen recently only enhance the sense of uncertainty and the need to protect critical data and applications.

While a disaster recovery plan checklist may have many goals, one of its greatest values is its ability to reassure company staff that they can handle any scenario. The suggestions given above are intended to guide your company up this path. 

 Need more details about disaster recovery? Follow the link below to download our FREE guide! 

Learn how to build a solid disaster recovery plan and maintain business continuity, while maximizing the use of your IT resources. 
Share      

Categories: disaster recovery, BCDR plan