Disaster Recovery Planning

Overview

When a failure happens, it is not just data that needs restoring, but the full working environment; in other words, disaster recovery.
Backup is an aspect of disaster recovery but not the full story; backups are a component of a disaster recovery plan. The overall goal of disaster recovery is to be able to get systems restored and running as quickly as possible, including the associated data.
The increasing use of virtualization has changed the way disaster recovery is carried out because, in a virtual world, a system can be recovered by duplicating images of virtual machines (VM) and recreating them elsewhere. In a VM environment snapshots now for an integral part of any disaster recovery plan.
Part of the planning is determining your RPO tolerance (how much, if any data or recent configuration changes you are prepared to lose)
The amount of resources that one puts into the disaster recovery program will depend on your RTO.

Snapshots

Server configuration changes are usually planned, are possibly under change management procedures, and the timing and predictability can likely be mostly controlled. A snapshot(s) are good for:
Scheduled (ad hoc):

Times when an application is being deployed and there is the possibility of development damaging my installed infrastructure.
Demos (Dev or Test Environment): There is a requirement to initially change and test with simulated "junk" the data during a demo and later go back to original version immediately after the demo/upgrades/updates is completed. This would occur when waiting for a regularly scheduled refresh of the environment is not acceptable.
Times of a scheduled change of the infrastructure (operating system upgrades/updates, hardware upgrades, etc )

Scheduled (ongoing basis)

Disaster recovery of the server build/infrastructure; Once a day or slightly less frequent would likely be sufficient since the environment should not change often outside of scheduled change periods above.

The retention period of Snapshots should also be considered. Following is an article from VMWARE on the use of snapshots:

Purpose

This article provides best practice information on using virtual machine snapshots.

Resolution

Follow these best practices when using snapshots in the vSphere environment:

Do not use snapshots as backups. The snapshot file is only a change log of the original virtual disk, it creates a place holder disk, virtual_machine-00000x-delta.vmdk, to store data changes since the time the snapshot was created. If the base disks are deleted, the snapshot files are not sufficient to restore a virtual machine.
VMware recommends only a maximum of 32 snapshots in a chain. However, for a better performance, use only 2 to 3 snapshots.
Do not use a single snapshot for more than 24-72 hours. The snapshot file continues to grow in size when it is retained for a longer period. This can cause the snapshot storage location to run out of space and impact the system performance.

For long term snapshots of server(s), it is better to copy the entire VMWARE server and then document it (name, purpose, date) and save it off to a storage location.

Backups

Database changes likely occur every second in an MES environment. It is in this area where there is a risk of data loss and RPO tolerance consideration is the most important. Snapshots aren't really an effective means of dealing with recovery from outages which involve rapidly changing transactional data unless they are constantly being taken at short intervals. This is often not practical due to the potential negative impact on the production server of additional load and storage limitations at sites to manage and store all the snapshots.
This traditional backup-and-restore scenario is more suited for databases with lots of real-time transactional data. This can be accomplished by the effective use and planning of Full, Differential, and Transaction Log Backups. One of the main considerations in the strategic deployment of these strategies is the RTO (Recovery Time Objective). A strategy utilizing only a Full Backup and Transaction logs will have a longer RTO than a strategy utilizing a Full Backup followed by Differential backups and transaction logs.

Example of Disaster Recovery Plan

VMWare Copies:

A full copy of the VMWare server.

Servers	Schedule	Retention
Application	Once every two weeks	One month
Visualization	Once every two weeks	One month
Process Historian	Once every two weeks	One month
SQL Server	Once every two weeks	One month
OPC	Once a month	One month

Snapshots

Servers	Schedule	Retention
Application	Every day	2 days
Visualization	Every day	2 days
Process Historian	Every day	2 days
SQL Server	Every day	2 days
OPC	Every day	2 days

Backups

SQL Databases

Server	Backup Type	Schedule	Retention
SQL Server
	Full	Every day	2 days
	Differential	Every 12 hours	2 days
	Transaction Log	Every 15 minutes	2 days
Note: If performing re-indexing its best to schedule it just before a full backup as it will create large differential backups if performed after a full backup.

This suggests an RPO Tolerance of 15 minutes. In theory, after a disaster, you could recover your data to within 15 or fewer minutes of the disaster. The RTO is hard to estimate because it would depend on the amount of data in the Transaction Log Files. If it is critical for a shorter RTO, then more Differentials should be performed during the day.

Cleanup

An example of disaster recovery.

#	Full Backup	Differentials	Transaction logs	Disaster
	12:00 AM
1			12:15 AM
2			12:30 AM
3			12:45 AM
4			1:00 AM
5			1:15 AM
6			1:30 AM
7			1:45 AM
8			2:00 AM
9			2:15 AM
10			2:30 AM
11			2:45 AM
12			3:00 AM
13			3:15 AM
14			3:30 AM
15			3:45 AM
16			4:00 AM
17			4:15 AM
18			4:30 AM
19			4:45 AM
20			5:00 AM
21			5:15 AM
22			5:30 AM
23			5:45 AM
24			6:00 AM
25			6:15 AM
26			6:30 AM
27			6:45 AM
28			7:00 AM
29			7:15 AM
	7:25			Disaster 1
30			7:30 AM
31			7:45 AM
32			8:00 AM
33			8:15 AM
34			8:30 AM
35			8:45 AM
36			9:00 AM
37			9:15 AM
38			9:30 AM
39			9:45 AM
40			10:00 AM
41			10:15 AM
42			10:30 AM
43			10:45 AM
44			11:00 AM
45			11:15 AM
46			11:30 AM
47			11:45 AM
48		12:00 PM	12:00 PM
49			12:15 PM
50			12:30 PM
51			12:45 PM
52			1:00 PM
	1:05 PM			Disaster 2
53			1:15 PM
54			1:30 PM
55			1:45 PM

With Disaster #1 (at 7:25 AM) the following recovery procedures would occur:

Restore full backup which was taken at 12:00 AM
Restore each transaction log in ascending sequential order from log # 1 to 29. (this is 29 distinct recovery commands).

10 minutes of data would be lost from 7:15 to 7:25
With Disaster #2 (at 1:05 PM) the following recovery procedures would occur:

Restore full backup which was taken at 12:00 AM
Restore the Differential Backup taken at 12:00 PM
Restore each transaction log in ascending sequential order from log # 49 to 52. (this is 4 distinct recovery commands).

5 minutes of data would be lost from 1:00 to 1:05
If the differential backup had not been done, then a full backup followed by 52 transaction log backups would have to have been performed.
Notes:

In the scenarios above the time that it takes to process the recovery of each transaction log is the main factor in the RTO calculation.
The full backup, differential backup, and transaction logs have to be from the same "set". A previous full backup would not work.
Regarding data loss and MES data specifically, most transactions that come as a result of process historian data can be "re-triggered" and reprocessed very easily to the same point in time of the disaster. The transactions that are at risk to be missed are any transactions that were manually entered by operators or downloaded from 3^rd party applications via interfaces.

SQL AGENT Database Backup

An SQLAgent job can be set up to perform Maintenance jobs to perform Database and transaction log backups.
For example, you might schedule three jobs: Full Backup, once per day, a Differential, every 12 hours, and Transaction Log backup every 15 minutes or half-hour. Database and Transaction Log backups would then be sent to an offline storage device like a NAS server.

Historian Archives and OPC Project Configuration Files

Server	Backup Type	Schedule	Retention
Historian (Off Site Backup)
	Full (all archives)	Once	Permanent
	Latest 3 archives	Daily	Keep one copy of each Purge duplicates
OPC Servers (Off Site Backup)
	Last "project" file (.opf))	Weekly or on ad-hoc change.	Keep last two current ones