Backup Site

Completed on August 2014, duration: 9 months

Executive Summary
My roles
Elaboration phase
Project plan and scope definition
Solution space assesment
Vendors & Service Providers selection
- Hardware/Software Vendors
- Service Providers
Design choices
Challenges
- Configuration Management for network devices
- Site-aware network
Performing production switch-over
Lessons learned
- Test after major change

Executive Summary

Client were required by regulator to setup a Disaster Recovery (DR) site, so impact on customers is limited in case of Primary site outage.

My roles

Project manager:

Communication with key stakeholders (including Executives)
Project planning, including risk management
Team management
Contract negotiations with Vendors
Preparing and approving budgets

System administrator:

Network design & configuration
Firewall configuration

DevOps engineer:

Design and implementation of Configuration Management process for network devices

Elaboration phase

We have approached client to understand high-level requirements and possible project constraints.

Infrastructure implemented using following technology stack:

VMware virtualization
HP Blade
NetApp NAS
Few stand-alone servers
Application stack: .Net & MS SQL

High-level requirements were:

Ability to resume operations with minimum impact on data integrity
Ability to resume operations with minimum possible down-time

Project constraints:

Client used legacy application with monolithic architecture

After reviewing existing setup, additional pain-points were identified:

Performance issues with current setup
Lack of resources (Compute & Storage)
Aging hardware & software
Environment management not adhering to best practices
Application deployment not adhering to best practices
Exisitng Service Provider for Primary site location not efficient
Unreliable DNS-RR implementation

Project plan and scope definition

Following items were put in scope for our team:

Assess solution space
Prepare high-level budget estimates
Provide recommendations for client
Proof-of-Concept for approved solution
Implement solution approved by client:
- Identify, engage and sign service provider for replacement of Primary location
- Identify, engage and sign service provider for DR location
- Engage and sign hardware and software vendors
- Develop and implement environment management according to best practices
- Acquire and configure all equipment and software for DR site
- Install hardware on to-be-Primary site
Transfer users from existing to to-be-Primary
Perform DR simulation and transfer users from Primary site to DR site for 1 week and then back to Primary

Following items were in scope of Client’s application vendor:

Application setup on DR site
Data transfer from Primary site

Solution space assesment

We have identified following options:

1. Partial upgrade for old site and covergent (FlexPod) hardware for DR site

Rejected by Client due to following considerations:

While this option doesn’t provide considerable performance improvement it also doesn’t provide any savings due to more (older) CPUs require more licenses.
Additionally, this option results in different hardware configuration requiring to double budget for implementing environment management best practices.

On a bright side, same software/hardware stack for DR site doesn’t require additional learning curve

2. Covergent (FlexPod) hardware at both sites

Accepted by the client due to following benefits:

Matching all requirements
Addressing all pain-points
Familiar software/hardware stack

While not the cheapest, solution serves as an enabler for further business growth and provides coherent approach to environment management.

3. Migration to cloud

Rejected by Client on a premise of following constraints:

Regulatory limitations
Monolithic application architecture
Require additional training investment

Vendors & Service Providers selection

Hardware/Software Vendors

Both sites implementing identical hardware setup, to enable switching-over process to become regulary practiced routine without impact on customers.

Cisco UCS blades (N+1 setup)
NetApp NAS
Cisco Switch stack
Cisco routers/firewalls
VMware

Service Providers

Tier 1 internet provider (hosting load-balancers)
Tier 3 DataCenter provider

Design choices

Active + Active/Warm/Cold

Active+Cold design has been chosen due to following reasons:

Existing deployment process doesn’t ensure identical Application versions to be installed at both sites
Exisitng environment management process doesn’t ensure identical OS configuration
Cost overhead on licensing related to maintaining running VMs on second site

User redirection between sites

BGP chosen over DNS-RR implementation due to higher reliability.

DNS-RR drawbacks:

Request caching between customers and authoritative servers
Up to few hours to propagate changes
Longer down-time for VPN connections

BGP drawbacks:

3 minute convergence time

Data replication between DataCenters

SnapMirror chosen over native SQL

SQL-replication drawbacks:

Require licenses to maintain recepient server
Doesn’t address VM image replication

SnapMirror replication drawbacks:

None identified

Application switching between sites

VMware SRM chosen over custom-development script.

Custom-development script drawbacks:

Require a lot of effort
Error prone

VMware SRM drawbacks:

Associated licenses

SRM: Customizing IP properties OR not?

While it complicates network design, we decided that we not going to change VMs IP addresses (SRM functionality) during migration between sites. This choice was dictated by inherent application design utilizing IPs (instead of names) for accessing resources and the fact that those IPs was spread across code, configuration files and Database.

SRM: Automate switching-over?

Considering possible business impact of un-wanted switch-over (~1 hour downtime with inconsistent data) triggered by false-positive event we decided to leave this to remain a human decision.

Environment management automation

Desired State Configuration from Windows Management Framework (WMF) was chosen over System Center Configuration Management.

SCCM drawbacks:

Lack of control implementing basic workflow functionality
Require separate management liceses

DSC drawbacks:

Requires additional training investment
At the time haven’t 100% coverage, thus requiring development of additional custom modules

Backup

Database backup was chosen over NetApp snapshots of full environment. We designed environment so VMs are as state-less as possible, with only persistance in Database. Proper environment management automation allowed us to rebuild all VMs in up to 2 hours using DSC policies.

NetApp snapshots of full environment drawbacks:

NetApp disk space is expensive

Database backup drawbacks:

Require longer recovery window

Challenges

Configuration Management for network devices

Considering that all equipment (including network) were redundant it was clear long before implementation phase that managing 4 devices of each type (router/firewall/switch) will become a challenge. For example, we have to be sure, that all 4 routers implement identical configuration (with only difference in a parts concerning location- or device-specifics). It also was clear, that managing all buisness functionality (VPNs, NATs, BGP, Firewall etc.) in original configuration format (single plain-text file) will become a pain as configuration size usualy doubles within a year as environment stabilizes. All this calls for

automation and
logical composition

I have developed set of scripts which allows usage of location- and site-specific variables and allows splitting configuration into logical building blocks, so during operation stage adding new functionality would become much more easier. When executed, script pulled together all logical blocks and substitutes variables with location/device-specific values to assemble single configuration file. Also, those scripts were ideal solution to re-use configuration best practices. Storing logical configuration sections in a version control system brought such benefits as:

transparent Change Management process
ability to rebuild configuration to any version

Site-aware network

To make Site switch-over less human-dependent (and therefore less error-prone) process we have to make our network aware of what site exactly is Active at given time. When implementing solutions to achieve this goal one of our primary focus was to avoid excessive complexity.

First of all, moving VMs without SRM changing IPs introduced additional challenges at switching traffic between sites on load-balancer side. Specifically, we were unable to use classic IPSec setup. To workaround this limitation we had to implement OSPF over GRE over IPSec between routers and load-balancers. This allowed us to avoid another manual step when implementing SRM workflow.

Another challenge was for routers to understand if site they installed at is currently Active. I have achieved this by using combination of IP SLA, object-tracking, OSPF route-maps - all orchestrated through Cisco’s Embeded Event Manager (EEM) scripts.

Performing production switch-over

At the final stage of implementation we have performed graceful (properly shutting down VMs on one site, syncronizing storage, orderly booting of VMs on another site) switch-over in under 1 hour with push of a button. Most time-consuming part of a workflow was actual VM startup (~25min), following by VM shutdown (~15min).

Lessons learned

Test after major change

When deploying load-balancer we hit kernel bug, resulting in huge performance drop after few dozens of customers access the site. As Service Provider provided us with OS which was different from our test lab in minor version only - we assumed it will work without any noticable issues. Wrong. Considering scope of change and impact such assumptions shouldn’t go un-verified. Even if Client is pushing you - don’t cut a corner here - explain how skipping this step may impact the business. Even if you have no idea about application internals - do load-testing as bare minimum.