Why we did VMware Site Recovery Manager
As we grew, the requirement for our core Computing resources to have a segregated site agnostic DR (Disaster Recovery) site had become more apparent.
To mitigate risk in the event of a total site failure/Fibre cut to our Production Site, Natural Disaster etc. the requirement to have the Production Server Room replicated and the ability to bring up the replicated site quickly and efficiently was mandatory to ensure the continuation of Business/Financial and Clinical care services.
We architected a solution to ensure the easiest, most scalable and robust way to provide a replicated site, for the day which we all hope never happens. Based on past technical design decisions, A replicated site was always deemed to be the best way to ensure our business compliance with the inability for the Production Site to function. The previous solution was to have Synchronous block level replication over dark fibre between Hitachi SANS to provide the storage layer component, and to provide stretched cluster technology.
However there were some issues with this.
- The cost of dark fibre was prohibitive,
- The age of the server hardware was in excess of five years
- The warranty period of the server infrastructure had expired.
- The licensing costs of the Hitachi SAN’s as well as the Replication License were in excess of $55000 per annum.
- We had not taken into account the Looming Capacity management issue When the server infrastructure was split in half to accommodate the second site
- Our total computing resources was split in half and we were already having contention issues for CPU/Memory and DISK.
- The use of a Microsoft Stretched Cluster whilst at the point of inception was cutting edge technology, at the point of implementation was clumsy, expensive and prohibitively complex.
A new take upon the problem was required. In February 2009, We undertook a capacity management exercise to identify the performance/growth requirements and general design decisions that would be required to move forward. On the onset a platform would be required that could be infinitely scalable without downtime and the abstraction of our servers systems to utilise virtualisation technology with the mindset of using VMware Site recovery Manager.
As a precursor to utilising SRM to protect the Production Site, We undertook a 100% virtualisation Project. The virtualisation project achieved quite a few milestones.
- Our infrastructure has been 100% virtualised (except for Nurse call) Consolidation from 34 Physical servers to 4 (currently running 160 Virtual Machines)
- Removed the need for Hitachi Maintenance saving of 55K per year Server Infrastructure that is infinitely scalable for future requirements
- Implementation of a Server Infrastructure where Sundale as an organisation can see performance metrics in real time.
- The implementation a Server Infrastructure where ICT can adhere to performance and uptime SLA’s to internal and external customers when required.
- Reduced power usage/backup and cooling requirements in our Production Data centre by 66%.
- The Implementation of a Data centre model, which is used by every single fortune 500 company.
- A successfully engineered and implemented Private cloud.
- With the next revisions of our Data centre virtualisation engine we were on track to be able to use and provide public and internal cloud federation services.
- With the successful implementation of our internal cloud, We were ready for a DR site (VMware Site Recovery Manager) to protect our Production site, via an asynchronous link (without the need of prohibitively expensive Dark Fibre between Sites). Automated Recovery plans and monthly testing of complete site failure without downtime would be possible.
As of January 2010, We started in earnest the planning for the DR/BCP project.
As the major task of virtualising our entire server infrastructure had been completed, the next logical step was to leverage our significant investment in VMware to utilise a plethora of VMware software to streamline and automate datacentre operations to reduce labor costs/complexity and improve general system health and user experience.
Hence the introduction of VMware Site Recovery Manager.
The requirements were for a one for one replicated server room at our DR Site, which is linked by a 100meg Telstra MWAN fibre. As part of the DR/BCP project the decision was made to get Telstra Managed Routers for both Production and DR Sites.
Whilst doing planning for data change rates, it was estimated that a 3-5 Hours replication lag would be experienced. The maximum data Rate we could push across the 100 Mbit link would be 6.9 MB per second, hence the 3-5 hour lag.
However the side project WAN Optimisation by utilising Riverbed WAN optimiser’s, our link utilisation went from 100% utilisation of the link to 30% utilisation, whilst increasing the replication rate to a peak of 36 MB per second.
With the increased replication rate we have decreased the replication lag to less than 15 Mins.
An exact mirror of the 4 Dell R Series Servers and SAS/SATA Equallogic Arrays were implemented at DR with a 15 min Block Level Replication schedule of Volumes.
ESXi 5.0 has been installed in conjunction with vSphere 5.0 and SRM 5.0 at Production.
ESXi 5.0 has been installed in conjunction with vSphere 5.0 and SRM 5.0 at DR.
SRM is a vCentre software application/plugin that uses what’s called a Storage Adapter.
A Storage Adapter natively hooks into the storage arrays on both sites and issues commands in order to manipulate storage volumes.
One of Site Recovery Manager’s greatest attributes is the ability to test the recovery plan before there is a failure. This allows administrators the ability to tune their recovery process and make sure that the plan is sound in the case of an actual failover. It also allows for comprehensive auditing of the recovery plans without impacting actual running virtual machines.
A test failover scenario is designed to completely limit impact onto the production VMs and Datastore volumes. When a test is run on the recovery site the recovery group is told to create a clone of the replica volumes and bring the clones online. SRM will send the iqn information from the ESX hosts to the SAN group to add to the access control list. Since it creates a clone of the replica, at no time is the production replication impacted. However, there needs to be extra free space in the group in order to create a clone and bring it online. If there is not enough free space the clone operation will fail and the test will fail with an error of not being able to see the Datastore volumes.
Once the clone is created and brought online, SRM will rescan the storage at the ESXi host level.
It will resignature the volume as needed to be able to bring that clone online as a new Datastore.
SRM will then re-register the Recovery VMs to point to the new clone volume and start powering them up in the order specified during the creation of the Protection group and Recovery plan. During the testing phase SRM will isolate the VMs based on the testing network setting that was configured earlier. By default it will create a Test-Bubble virtual switch with no external NIC connections. Again, this is done along with everything else to protect the production environment.
By using Site Recovery Manager we have implemented a solution that can capture our Production Site at a 15 minute delay and bring in up effectively in either a test bubble or enact a real life Disaster Recovery scenario.
The Use of VRRP and BGP as a mechanism to bridge the server VLAN’s across the 2 sites has made the transition to the live site completely seamless to the remote sites, hence the clients.
If anyone has any thoughts on these types of solutions and how the future of them will change going to the cloud, comments are welcome.