Completed on August 2014, duration: 9 months

Executive Summary

Client were required by regulator to setup a Disaster Recovery (DR) site, so impact on customers is limited in case of Primary site outage.

My roles

Project manager:

  • Communication with key stakeholders (including Executives)
  • Project planning, including risk management
  • Team management
  • Contract negotiations with Vendors
  • Preparing and approving budgets

System administrator:

  • Network design & configuration
  • Firewall configuration

DevOps engineer:

Elaboration phase

We have approached client to understand high-level requirements and possible project constraints.

Infrastructure implemented using following technology stack:

  • VMware virtualization
  • HP Blade
  • NetApp NAS
  • Few stand-alone servers
  • Application stack: .Net & MS SQL

High-level requirements were:

  1. Ability to resume operations with minimum impact on data integrity
  2. Ability to resume operations with minimum possible down-time

Project constraints:

  • Client used legacy application with monolithic architecture

After reviewing existing setup, additional pain-points were identified:

  1. Performance issues with current setup
  2. Lack of resources (Compute & Storage)
  3. Aging hardware & software
  4. Environment management not adhering to best practices
  5. Application deployment not adhering to best practices
  6. Exisitng Service Provider for Primary site location not efficient
  7. Unreliable DNS-RR implementation

Project plan and scope definition

Following items were put in scope for our team:

  • Assess solution space
  • Prepare high-level budget estimates
  • Provide recommendations for client
  • Proof-of-Concept for approved solution
  • Implement solution approved by client:
    • Identify, engage and sign service provider for replacement of Primary location
    • Identify, engage and sign service provider for DR location
    • Engage and sign hardware and software vendors
    • Develop and implement environment management according to best practices
    • Acquire and configure all equipment and software for DR site
    • Install hardware on to-be-Primary site
  • Transfer users from existing to to-be-Primary
  • Perform DR simulation and transfer users from Primary site to DR site for 1 week and then back to Primary

Following items were in scope of Client’s application vendor:

  • Application setup on DR site
  • Data transfer from Primary site

Solution space assesment

We have identified following options:

1. Partial upgrade for old site and covergent (FlexPod) hardware for DR site

Rejected by Client due to following considerations:

  • While this option doesn’t provide considerable performance improvement it also doesn’t provide any savings due to more (older) CPUs require more licenses.
  • Additionally, this option results in different hardware configuration requiring to double budget for implementing environment management best practices.

On a bright side, same software/hardware stack for DR site doesn’t require additional learning curve

2. Covergent (FlexPod) hardware at both sites

Accepted by the client due to following benefits:

  • Matching all requirements
  • Addressing all pain-points
  • Familiar software/hardware stack

While not the cheapest, solution serves as an enabler for further business growth and provides coherent approach to environment management.

3. Migration to cloud

Rejected by Client on a premise of following constraints:

  • Regulatory limitations
  • Monolithic application architecture
  • Require additional training investment

Vendors & Service Providers selection

Hardware/Software Vendors

Both sites implementing identical hardware setup, to enable switching-over process to become regulary practiced routine without impact on customers.

  • Cisco UCS blades (N+1 setup)
  • NetApp NAS
  • Cisco Switch stack
  • Cisco routers/firewalls
  • VMware
Service Providers
  • Tier 1 internet provider (hosting load-balancers)
  • Tier 3 DataCenter provider

Design choices

Active + Active/Warm/Cold

Active+Cold design has been chosen due to following reasons:

  • Existing deployment process doesn’t ensure identical Application versions to be installed at both sites
  • Exisitng environment management process doesn’t ensure identical OS configuration
  • Cost overhead on licensing related to maintaining running VMs on second site
User redirection between sites

BGP chosen over DNS-RR implementation due to higher reliability.

DNS-RR drawbacks:

  • Request caching between customers and authoritative servers
  • Up to few hours to propagate changes
  • Longer down-time for VPN connections

BGP drawbacks:

  • 3 minute convergence time
Data replication between DataCenters

SnapMirror chosen over native SQL

SQL-replication drawbacks:

  • Require licenses to maintain recepient server
  • Doesn’t address VM image replication

SnapMirror replication drawbacks:

  • None identified
Application switching between sites

VMware SRM chosen over custom-development script.

Custom-development script drawbacks:

  • Require a lot of effort
  • Error prone

VMware SRM drawbacks:

  • Associated licenses
SRM: Customizing IP properties OR not?

While it complicates network design, we decided that we not going to change VMs IP addresses (SRM functionality) during migration between sites. This choice was dictated by inherent application design utilizing IPs (instead of names) for accessing resources and the fact that those IPs was spread across code, configuration files and Database.

SRM: Automate switching-over?

Considering possible business impact of un-wanted switch-over (~1 hour downtime with inconsistent data) triggered by false-positive event we decided to leave this to remain a human decision.

Environment management automation

Desired State Configuration from Windows Management Framework (WMF) was chosen over System Center Configuration Management.

SCCM drawbacks:

  • Lack of control implementing basic workflow functionality
  • Require separate management liceses

DSC drawbacks:

  • Requires additional training investment
  • At the time haven’t 100% coverage, thus requiring development of additional custom modules
Backup

Database backup was chosen over NetApp snapshots of full environment. We designed environment so VMs are as state-less as possible, with only persistance in Database. Proper environment management automation allowed us to rebuild all VMs in up to 2 hours using DSC policies.

NetApp snapshots of full environment drawbacks:

  • NetApp disk space is expensive

Database backup drawbacks:

  • Require longer recovery window

Challenges

Configuration Management for network devices

Considering that all equipment (including network) were redundant it was clear long before implementation phase that managing 4 devices of each type (router/firewall/switch) will become a challenge. For example, we have to be sure, that all 4 routers implement identical configuration (with only difference in a parts concerning location- or device-specifics). It also was clear, that managing all buisness functionality (VPNs, NATs, BGP, Firewall etc.) in original configuration format (single plain-text file) will become a pain as configuration size usualy doubles within a year as environment stabilizes. All this calls for

  • automation and
  • logical composition

I have developed set of scripts which allows usage of location- and site-specific variables and allows splitting configuration into logical building blocks, so during operation stage adding new functionality would become much more easier. When executed, script pulled together all logical blocks and substitutes variables with location/device-specific values to assemble single configuration file. Also, those scripts were ideal solution to re-use configuration best practices. Storing logical configuration sections in a version control system brought such benefits as:

  • transparent Change Management process
  • ability to rebuild configuration to any version
Site-aware network

To make Site switch-over less human-dependent (and therefore less error-prone) process we have to make our network aware of what site exactly is Active at given time. When implementing solutions to achieve this goal one of our primary focus was to avoid excessive complexity.

First of all, moving VMs without SRM changing IPs introduced additional challenges at switching traffic between sites on load-balancer side. Specifically, we were unable to use classic IPSec setup. To workaround this limitation we had to implement OSPF over GRE over IPSec between routers and load-balancers. This allowed us to avoid another manual step when implementing SRM workflow.

Another challenge was for routers to understand if site they installed at is currently Active. I have achieved this by using combination of IP SLA, object-tracking, OSPF route-maps - all orchestrated through Cisco’s Embeded Event Manager (EEM) scripts.

Performing production switch-over

At the final stage of implementation we have performed graceful (properly shutting down VMs on one site, syncronizing storage, orderly booting of VMs on another site) switch-over in under 1 hour with push of a button. Most time-consuming part of a workflow was actual VM startup (~25min), following by VM shutdown (~15min).

Lessons learned

Test after major change

When deploying load-balancer we hit kernel bug, resulting in huge performance drop after few dozens of customers access the site. As Service Provider provided us with OS which was different from our test lab in minor version only - we assumed it will work without any noticable issues. Wrong. Considering scope of change and impact such assumptions shouldn’t go un-verified. Even if Client is pushing you - don’t cut a corner here - explain how skipping this step may impact the business. Even if you have no idea about application internals - do load-testing as bare minimum.