Overview
When a failure happens, it is not just data that needs restoring, but the full working environment; in other words, disaster recovery.
Backup is an aspect of disaster recovery but not the full story; backups are a component of a disaster recovery plan. The overall goal of disaster recovery is to be able to get systems restored and running as quickly as possible, including the associated data.
The increasing use of virtualization has changed the way disaster recovery is carried out because, in a virtual world, a system can be recovered by duplicating images of virtual machines (VM) and recreating them elsewhere. In a VM environment snapshots now for an integral part of any disaster recovery plan.
Part of the planning is determining your RPO tolerance (how much, if any data or recent configuration changes you are prepared to lose)
The amount of resources that one puts into the disaster recovery program will depend on your RTO.
Snapshots
Server configuration changes are usually planned, are possibly under change management procedures, and the timing and predictability can likely be mostly controlled. A snapshot(s) are good for:
Scheduled (ad hoc):
Times when an application is being deployed and there is the possibility of development damaging my installed infrastructure.
Demos (Dev or Test Environment): There is a requirement to initially change and test with simulated "junk" the data during a demo and later go back to original version immediately after the demo/upgrades/updates is completed. This would occur when waiting for a regularly scheduled refresh of the environment is not acceptable.
Times of a scheduled change of the infrastructure (operating system upgrades/updates, hardware upgrades, etc )
Scheduled (ongoing basis)
Disaster recovery of the server build/infrastructure; Once a day or slightly less frequent would likely be sufficient since the environment should not change often outside of scheduled change periods above.
The retention period of Snapshots should also be considered. Following is an article from VMWARE on the use of snapshots:
Purpose
This article provides best practice information on using virtual machine snapshots.
Resolution
Follow these best practices when using snapshots in the vSphere environment:
Do not use snapshots as backups. The snapshot file is only a change log of the original virtual disk, it creates a place holder disk, virtual_machine-00000x-delta.vmdk, to store data changes since the time the snapshot was created. If the base disks are deleted, the snapshot files are not sufficient to restore a virtual machine.
VMware recommends only a maximum of 32 snapshots in a chain. However, for a better performance, use only 2 to 3 snapshots.
Do not use a single snapshot for more than 24-72 hours. The snapshot file continues to grow in size when it is retained for a longer period. This can cause the snapshot storage location to run out of space and impact the system performance.
For long term snapshots of server(s), it is better to copy the entire VMWARE server and then document it (name, purpose, date) and save it off to a storage location.
Backups
Database changes likely occur every second in an MES environment. It is in this area where there is a risk of data loss and RPO tolerance consideration is the most important. Snapshots aren't really an effective means of dealing with recovery from outages which involve rapidly changing transactional data unless they are constantly being taken at short intervals. This is often not practical due to the potential negative impact on the production server of additional load and storage limitations at sites to manage and store all the snapshots.
This traditional backup-and-restore scenario is more suited for databases with lots of real-time transactional data. This can be accomplished by the effective use and planning of Full, Differential, and Transaction Log Backups. One of the main considerations in the strategic deployment of these strategies is the RTO (Recovery Time Objective). A strategy utilizing only a Full Backup and Transaction logs will have a longer RTO than a strategy utilizing a Full Backup followed by Differential backups and transaction logs.
Example of Disaster Recovery Plan
VMWare Copies
A full copy of the VMWare server.
Servers | Schedule | Retention |
Application | Once every two weeks | One month |
Visualization | Once every two weeks | One month |
Process Historian | Once every two weeks | One month |
SQL Server | Once every two weeks | One month |
OPC | Once a month | One month |
Snapshots
Servers | Schedule | Retention |
Application | Every day | 2 days |
Visualization | Every day | 2 days |
Process Historian | Every day | 2 days |
SQL Server | Every day | 2 days |
OPC | Every day | 2 days |
Backups
SQL Databases
Server | Backup Type | Schedule | Retention |
SQL Server | |||
Full | Every day | 2 days | |
Differential | Every 12 hours | 2 days | |
Transaction Log | Every 15 minutes | 2 days | |
Note: If performing re-indexing its best to schedule it just before a full backup as it will create large differential backups if performed after a full backup. |
This suggests an RPO Tolerance of 15 minutes. In theory, after a disaster, you could recover your data to within 15 or fewer minutes of the disaster. The RTO is hard to estimate because it would depend on the amount of data in the Transaction Log Files. If it is critical for a shorter RTO, then more Differentials should be performed during the day.
Cleanup
An example of disaster recovery.
# | Full Backup | Differentials | Transaction logs | Disaster |
12:00 AM | ||||
1 | 12:15 AM | |||
2 | 12:30 AM | |||
3 | 12:45 AM | |||
4 | 1:00 AM | |||
5 | 1:15 AM | |||
6 | 1:30 AM | |||
7 | 1:45 AM | |||
8 | 2:00 AM | |||
9 | 2:15 AM | |||
10 | 2:30 AM | |||
11 | 2:45 AM | |||
12 | 3:00 AM | |||
13 | 3:15 AM | |||
14 | 3:30 AM | |||
15 | 3:45 AM | |||
16 | 4:00 AM | |||
17 | 4:15 AM | |||
18 | 4:30 AM | |||
19 | 4:45 AM | |||
20 | 5:00 AM | |||
21 | 5:15 AM | |||
22 | 5:30 AM | |||
23 | 5:45 AM | |||
24 | 6:00 AM | |||
25 | 6:15 AM | |||
26 | 6:30 AM | |||
27 | 6:45 AM | |||
28 | 7:00 AM | |||
29 | 7:15 AM | |||
7:25 | Disaster 1 | |||
30 | 7:30 AM | |||
31 | 7:45 AM | |||
32 | 8:00 AM | |||
33 | 8:15 AM | |||
34 | 8:30 AM | |||
35 | 8:45 AM | |||
36 | 9:00 AM | |||
37 | 9:15 AM | |||
38 | 9:30 AM | |||
39 | 9:45 AM | |||
40 | 10:00 AM | |||
41 | 10:15 AM | |||
42 | 10:30 AM | |||
43 | 10:45 AM | |||
44 | 11:00 AM | |||
45 | 11:15 AM | |||
46 | 11:30 AM | |||
47 | 11:45 AM | |||
48 | 12:00 PM | 12:00 PM | ||
49 | 12:15 PM | |||
50 | 12:30 PM | |||
51 | 12:45 PM | |||
52 | 1:00 PM | |||
1:05 PM | Disaster 2 | |||
53 | 1:15 PM | |||
54 | 1:30 PM | |||
55 | 1:45 PM |
With Disaster #1 (at 7:25 AM) the following recovery procedures would occur:
Restore full backup which was taken at 12:00 AM
Restore each transaction log in ascending sequential order from log # 1 to 29. (this is 29 distinct recovery commands).
10 minutes of data would be lost from 7:15 to 7:25
With Disaster #2 (at 1:05 PM) the following recovery procedures would occur:
Restore full backup which was taken at 12:00 AM
Restore the Differential Backup taken at 12:00 PM
Restore each transaction log in ascending sequential order from log # 49 to 52. (this is 4 distinct recovery commands).
5 minutes of data would be lost from 1:00 to 1:05
If the differential backup had not been done, then a full backup followed by 52 transaction log backups would have to have been performed.
Notes:
In the scenarios above the time that it takes to process the recovery of each transaction log is the main factor in the RTO calculation.
The full backup, differential backup, and transaction logs have to be from the same "set". A previous full backup would not work.
Regarding data loss and MES data specifically, most transactions that come as a result of process historian data can be "re-triggered" and reprocessed very easily to the same point in time of the disaster. The transactions that are at risk to be missed are any transactions that were manually entered by operators or downloaded from 3rd party applications via interfaces.
SQL AGENT Database Backup
An SQLAgent job can be set up to perform Maintenance jobs to perform Database and transaction log backups.
For example, you might schedule three jobs: Full Backup, once per day, a Differential, every 12 hours, and Transaction Log backup every 15 minutes or half-hour. Database and Transaction Log backups would then be sent to an offline storage device like a NAS server.
Historian Archives and OPC Project Configuration Files
Server | Backup Type | Schedule | Retention |
Historian (Off Site Backup) | |||
Full (all archives) | Once | Permanent | |
Latest 3 archives | Daily | Keep one copy of each | |
OPC Servers (Off Site Backup) | |||
Last "project" file (.opf)) | Weekly or on ad-hoc change. | Keep last two current ones |