Completed on August 2014, duration: 9 months
- Executive Summary
- My roles
- Elaboration phase
- Project plan and scope definition
- Solution space assesment
- Vendors & Service Providers selection
- Design choices
- Challenges
- Performing production switch-over
- Lessons learned
Executive Summary
Client were required by regulator to setup a Disaster Recovery (DR) site, so impact on customers is limited in case of Primary site outage.
My roles
Project manager:
- Communication with key stakeholders (including Executives)
- Project planning, including risk management
- Team management
- Contract negotiations with Vendors
- Preparing and approving budgets
System administrator:
- Network design & configuration
- Firewall configuration
DevOps engineer:
- Design and implementation of Configuration Management process for network devices
Elaboration phase
We have approached client to understand high-level requirements and possible project constraints.
Infrastructure implemented using following technology stack:
- VMware virtualization
- HP Blade
- NetApp NAS
- Few stand-alone servers
- Application stack: .Net & MS SQL
High-level requirements were:
- Ability to resume operations with minimum impact on data integrity
- Ability to resume operations with minimum possible down-time
Project constraints:
- Client used legacy application with monolithic architecture
After reviewing existing setup, additional pain-points were identified:
- Performance issues with current setup
- Lack of resources (Compute & Storage)
- Aging hardware & software
- Environment management not adhering to best practices
- Application deployment not adhering to best practices
- Exisitng Service Provider for Primary site location not efficient
- Unreliable DNS-RR implementation
Project plan and scope definition
Following items were put in scope for our team:
- Assess solution space
- Prepare high-level budget estimates
- Provide recommendations for client
- Proof-of-Concept for approved solution
- Implement solution approved by client:
- Identify, engage and sign service provider for replacement of Primary location
- Identify, engage and sign service provider for DR location
- Engage and sign hardware and software vendors
- Develop and implement environment management according to best practices
- Acquire and configure all equipment and software for DR site
- Install hardware on to-be-Primary site
- Transfer users from existing to to-be-Primary
- Perform DR simulation and transfer users from Primary site to DR site for 1 week and then back to Primary
Following items were in scope of Client’s application vendor:
- Application setup on DR site
- Data transfer from Primary site
Solution space assesment
We have identified following options:
1. Partial upgrade for old site and covergent (FlexPod) hardware for DR site
Rejected by Client due to following considerations:
- While this option doesn’t provide considerable performance improvement it also doesn’t provide any savings due to more (older) CPUs require more licenses.
- Additionally, this option results in different hardware configuration requiring to double budget for implementing environment management best practices.
On a bright side, same software/hardware stack for DR site doesn’t require additional learning curve
2. Covergent (FlexPod) hardware at both sites
Accepted by the client due to following benefits:
- Matching all requirements
- Addressing all pain-points
- Familiar software/hardware stack
While not the cheapest, solution serves as an enabler for further business growth and provides coherent approach to environment management.
3. Migration to cloud
Rejected by Client on a premise of following constraints:
- Regulatory limitations
- Monolithic application architecture
- Require additional training investment
Vendors & Service Providers selection
Hardware/Software Vendors
Both sites implementing identical hardware setup, to enable switching-over process to become regulary practiced routine without impact on customers.
- Cisco UCS blades (N+1 setup)
- NetApp NAS
- Cisco Switch stack
- Cisco routers/firewalls
- VMware
Service Providers
- Tier 1 internet provider (hosting load-balancers)
- Tier 3 DataCenter provider
Design choices
Active + Active/Warm/Cold
Active+Cold design has been chosen due to following reasons:
- Existing deployment process doesn’t ensure identical Application versions to be installed at both sites
- Exisitng environment management process doesn’t ensure identical OS configuration
- Cost overhead on licensing related to maintaining running VMs on second site
User redirection between sites
BGP chosen over DNS-RR implementation due to higher reliability.
DNS-RR drawbacks:
- Request caching between customers and authoritative servers
- Up to few hours to propagate changes
- Longer down-time for VPN connections
BGP drawbacks:
- 3 minute convergence time
Data replication between DataCenters
SnapMirror chosen over native SQL
SQL-replication drawbacks:
- Require licenses to maintain recepient server
- Doesn’t address VM image replication
SnapMirror replication drawbacks:
- None identified
Application switching between sites
VMware SRM chosen over custom-development script.
Custom-development script drawbacks:
- Require a lot of effort
- Error prone
VMware SRM drawbacks:
- Associated licenses
SRM: Customizing IP properties OR not?
While it complicates network design, we decided that we not going to change VMs IP addresses (SRM functionality) during migration between sites. This choice was dictated by inherent application design utilizing IPs (instead of names) for accessing resources and the fact that those IPs was spread across code, configuration files and Database.
SRM: Automate switching-over?
Considering possible business impact of un-wanted switch-over (~1 hour downtime with inconsistent data) triggered by false-positive event we decided to leave this to remain a human decision.
Environment management automation
Desired State Configuration from Windows Management Framework (WMF) was chosen over System Center Configuration Management.
SCCM drawbacks:
- Lack of control implementing basic workflow functionality
- Require separate management liceses
DSC drawbacks:
- Requires additional training investment
- At the time haven’t 100% coverage, thus requiring development of additional custom modules
Backup
Database backup was chosen over NetApp snapshots of full environment. We designed environment so VMs are as state-less as possible, with only persistance in Database. Proper environment management automation allowed us to rebuild all VMs in up to 2 hours using DSC policies.
NetApp snapshots of full environment drawbacks:
- NetApp disk space is expensive
Database backup drawbacks:
- Require longer recovery window
Challenges
Configuration Management for network devices
Considering that all equipment (including network) were redundant it was clear long before implementation phase that managing 4 devices of each type (router/firewall/switch) will become a challenge. For example, we have to be sure, that all 4 routers implement identical configuration (with only difference in a parts concerning location- or device-specifics). It also was clear, that managing all buisness functionality (VPNs, NATs, BGP, Firewall etc.) in original configuration format (single plain-text file) will become a pain as configuration size usualy doubles within a year as environment stabilizes. All this calls for
- automation and
- logical composition
I have developed set of scripts which allows usage of location- and site-specific variables and allows splitting configuration into logical building blocks, so during operation stage adding new functionality would become much more easier. When executed, script pulled together all logical blocks and substitutes variables with location/device-specific values to assemble single configuration file. Also, those scripts were ideal solution to re-use configuration best practices. Storing logical configuration sections in a version control system brought such benefits as:
- transparent Change Management process
- ability to rebuild configuration to any version
Site-aware network
To make Site switch-over less human-dependent (and therefore less error-prone) process we have to make our network aware of what site exactly is Active at given time. When implementing solutions to achieve this goal one of our primary focus was to avoid excessive complexity.
First of all, moving VMs without SRM changing IPs introduced additional challenges at switching traffic between sites on load-balancer side. Specifically, we were unable to use classic IPSec setup. To workaround this limitation we had to implement OSPF over GRE over IPSec between routers and load-balancers. This allowed us to avoid another manual step when implementing SRM workflow.
Another challenge was for routers to understand if site they installed at is currently Active. I have achieved this by using combination of IP SLA, object-tracking, OSPF route-maps - all orchestrated through Cisco’s Embeded Event Manager (EEM) scripts.
Performing production switch-over
At the final stage of implementation we have performed graceful (properly shutting down VMs on one site, syncronizing storage, orderly booting of VMs on another site) switch-over in under 1 hour with push of a button. Most time-consuming part of a workflow was actual VM startup (~25min), following by VM shutdown (~15min).
Lessons learned
Test after major change
When deploying load-balancer we hit kernel bug, resulting in huge performance drop after few dozens of customers access the site. As Service Provider provided us with OS which was different from our test lab in minor version only - we assumed it will work without any noticable issues. Wrong. Considering scope of change and impact such assumptions shouldn’t go un-verified. Even if Client is pushing you - don’t cut a corner here - explain how skipping this step may impact the business. Even if you have no idea about application internals - do load-testing as bare minimum.