The growing dependence on public cloud platforms has made many organizations overlook the value of a BCDR capability. But what if the internet itself goes down?
Internet outages are becoming increasingly frequent. There have been incidents where DDoS attacks that targeted DNS services left many corporations inaccessible for hours on end. If connectivity is unavailable, an organization can do very little in such a scenario, even with the state of the art capabilities of a public cloud solution.
Organizations must develop a Business Continuity and Disaster Recovery solution that supports their cloud and MSP services.
Outlining a BCDR Strategy
Any BCDR solution that resolves internet outages must factor the following considerations:
- Recovering IT resources
- What are the mission critical IT components (voice, video)
- Recovering Production Data
- Do external vendors have access to production data ?
- Recovering Access to Compute Resources and Data
- What is the purpose – Internal (Employees) or External (Customers)
- Recovering prioritized communication assets
- Incoming and outgoing voice calls especially in call center facilities
- Recovering intro-company communication
- How will voice and video data be transferred between sites
Calibrating the impact of Natural Hazards
Data centers being burned to ashes or washed away by a cyclonic storm and many other high intensity natural hazards are usually the worst case scenarios in the spectrum of business disruptions.
Apart from these catastrophes, there are a whole range of disasters that can still impact operations. There are many strategies that businesses can employ to keep productions up and running such as:
- Redundant power supplies, including UPS and generators
- Fire suppression systems and sprinklers
Many often overlook the fact that in the event of a devastating natural disaster that stalls operations, people – be it customers, clients, public or stakeholders – are quite forthcoming in extending solidarity and supporting the affected enterprise through the crisis.
Nevertheless, customers, stakeholders and other interest groups are more concerned about the next category of hazards which have a higher probability and can be just as impactful. For instance:
- Faulty Software Updates in a Bank’s Mainframe
- Customer Database Corruption
- Insurance Company that Can’t Process Claims
- Airline Company that Can’t accept online bookings
Companies can save a lot of money by dealing decisively with such incidents. Revenue losses can be curbed significantly when enterprises have a plan in place for recovering specific groups of applications and systems.
Restoring Compute Resources & Data
Even public clouds go down. When they do, they can indirectly impact even business operations using MSPs that have subscribed to the cloud service. Enterprises should subscribe to cloud service providers that have backup locations in place.
Since current data is crucial for any business, the cloud service solution must also be able to fail over almost instantaneously.
Restoring Access
Accessing critical resources is crucial during a disaster. If customer access to primary systems is disrupted, the DR systems that take over must be identical. This can be achieved by routing public DNS entries to the DR servers.
Consider the following scenario:
A company outsources the hosting of its website to a third party cloud service provider whose facility is located at the other end of the globe. IT teams would look to transfer all systems to this location in the event of a natural or man made disaster. Production data has also been replicated.
IT teams now have three options for transferring DNS entries to the new IP addresses at the DR location.
- The company has its own DNS hosting provider – IT teams log into their hosting provider account and mention the new IP addresses. DNS responses must also mention the period of time during which a device will cache responses to DNS queries. This timeframe, also known as the time to live (TTL) parameter should be short so that quick responses are provided to the customers with the new IP addresses.
- Create a Cloud WAF or DDoS protection service account – Log in and mention the new IP addresses. An advantage with this option is that the new IP addresses go live instantly, which are made further secure through the WAF or DDoS protection.
- Global load balancer – IT teams use a DNS host to constantly keep a tab on the main and alternate servers and automatically switch to the alternate server when access to the main server is interrupted.
It is also good practice for enterprises to have more than one DNS provider option. However, in such a scenario, IT teams would have to switch the entries for all the DNS vendors when disaster strikes. Companies must also ensure that the facilities of their DNS vendors are spread across geographical locations so that the same hazard doesn’t impact different DNS services simultaneously.
Mission Critical Information
Many processes require crucial data that can be sourced either internally or externally. These processes include checking share prices, outsourcing insurance claim processing to third party vendors, connecting with banks and financial institutions for payment processing and so on.
While executing such tasks, the alternate servers should be able communicate with external sources. Sometimes, access to external servers is interrupted due to circuit malfunction. In such cases, it would make sense to have an alternate route from the main servers to the external servers. This way, unnecessary failovers to the alternate servers can be avoided.
Redundant circuits from the alternate location to external sources aren’t required in an active-standby setup. The backup circuits come into play only when multiple failures occur at the same time. While this is a very rare possibility, IT teams must still consider the specific risks associated with their enterprise’s operational environment.
Recovering Intra-Company Communications
Information sharing within the organization via email, messaging, telephone and video calls varies from company to company. Some rely heavily on broadband connections while others set up VPN tunnels between locations.
Companies tend to move their internal communications to a cloud platform to save on expense. Building an in-house standby setup for internal communications might not be monetarily feasible. However, workarounds are still possible such as using cell phones for making calls and deeming videoconferencing options unnecessary during crisis situations.
While using cell phones for making calls, enterprises will have to create a list of numbers for each branch. However, inbound calling might still pose a few constraints if it involves the general public.
Companies can also set up analog phones and make the numbers available to the public or customers. The phone service provider can then route calls from the public numbers to these analog lines during a business disruption.
Designing a recovery strategy for inbound calls gets tricky in larger scale operations, for instance, in the case of a call center. In such a case, calls would have to be diverted to an alternate location with insufficient functionality.
Maintenance
Full-fledged testing drills take place once a year and run the risk of losing relevance. IT teams can alternately scale down the scope of their reviews to target specific segments.
For instance, IT teams can check if
- Applications are still operational when production cloud services are moved from one public cloud zone to another
- Systems seamlessly switch to the alternate network connection
It’s good practice to test all redundant infrastructures periodically.
Many prefer a consolidated approach where DR capabilities are tested against a facility wide disruption such as the complete outage of a data center or a cloud availability zone.
IT teams should ideally balance the two approaches, starting with smaller tests that target specific components and then moving on to simulate a real disaster scenario.
Simulated Exercises
Simulating large scale business disruptions requires extensive research and can be time consuming. IT teams must take recent incidents of the following scenarios into consideration while planning their simulated tests:
- DDoS attacks against a major service vendor
- DNS Server Access Interruptions
- Cloud Storage Access Interruptions
- No Connectivity with Cloud Avaliability Zone or Data Center
- Targeted Internet Security Breaches
Automation
Testing procedures will evolve with more iteration and IT teams will get a clearer picture of segments that need to be automated, for instance,
- Transferring hundreds of virtual machines between data centers
- Activating multiple VPN lines simultaneously
Such activities require automation to cut down time and error.
While designing the BCDR capability, management teams must get directly involved and take an informed decision when the plans should be triggered. Alternating frequently between primary and alternate facilities at the slightest pretext would do more harm than good.
Protocols for backing up connections and adding redundant traffic routes are quite resilient as they have been evolving over decades and can be used effectively.
Changes can also be implemented globally to save time during a disaster. For example,
- Deploying global NAT rule so that the DR server mirrors the production server
- Activating multiple standby VPN lines
Conclusion
Disaster Recovery plans lose their relevance and value if they are not up to date when a disaster strikes. Plans and inventories must be updated automatically as and when there is a new addition to the infrastructure, such as a remote location or virtual server. Not all functions need to be recovered during a business disruption. Mission critical systems must be prioritized for recovery.
See for yourself how the application works
Witness our cloud based platform’s security capabilities in action
Play around with the software and explore its features
Compare and choose a solution that’s relevant to your organization
Consult our experts and decide on a pricing mechanism
Disasters
[carousel id=’1780′ items=’4′ items_desktop=’3′ margin_right=’5′ navigation=’false’] [item img_link=”https://www.stayinbusiness.com/wp-content/uploads/2016/02/Chemical-Spills-Discharges.jpg” href=”https://www.stayinbusiness.com/resource/disaster-recovery/chemical-spills-and-discharges/”][item img_link=”https://www.stayinbusiness.com/wp-content/uploads/2016/02/Riots-Public-Disturbances.jpg” href=”https://www.stayinbusiness.com/resource/disaster-recovery/riots-and-public-disturbances/”][item img_link=”https://www.stayinbusiness.com/wp-content/uploads/2016/02/Terrorism.jpg” href=”https://www.stayinbusiness.com/resource/disaster-recovery/terrorism/”] [item img_link=”https://www.stayinbusiness.com/wp-content/uploads/2016/02/worst-product-recall.jpg” href=”https://www.stayinbusiness.com/resource/disaster-recovery/product-recall/”] [/carousel]