Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When a failure happens, it is not just data that needs restoring, but the full working environment; in other words, disaster recovery.
Backup is an aspect of disaster recovery but not the full story; backups are a component of a disaster recovery plan. The overall goal of disaster recovery is to be able to get systems restored and running as quickly as possible, including the associated data.
The increasing use of virtualization has changed the way disaster recovery is carried out because, in a virtual world, a system can be recovered by duplicating images of virtual machines (VM) and recreating them elsewhere. In a VM environment snapshots now for an integral part of any disaster recovery plan.
Part of the planning is determining your RPO tolerance (how much, if any data or recent configuration changes you are prepared to lose)
The amount of resources that one puts into the disaster recovery program will depend on your RTO.

Snapshots

Server configuration changes are usually planned, are possibly under change management procedures, and the timing and predictability can likely be mostly controlled. A snapshot(s) are good for:
Scheduled (ad hoc):

...

The retention period of Snapshots should also be considered. Following is an article from VMWARE on the use of snapshots:

Purpose

This article provides best practice information on using virtual machine snapshots.

Resolution

Follow these best practices when using snapshots in the vSphere environment:

...

For long term snapshots of server(s), it is better to copy the entire VMWARE server and then document it (name, purpose, date) and save it off to a storage location.

Disaster Recovery Plan

Database changes likely occur every second in an MES environment. It is in this area where there is a risk of data loss and RPO tolerance consideration is the most important. Snapshots aren't really an effective means of dealing with recovery from outages which involve rapidly changing transactional data unless they are constantly being taken at short intervals. This is often not practical due to the potential negative impact on the production server of additional load and storage limitations at sites to manage and store all the snapshots.
This traditional backup-and-restore scenario is more suited for databases with lots of real-time transactional data. This can be accomplished by the effective use and planning of Full, Differential, and Transaction Log Backups. One of the main considerations in the strategic deployment of these strategies is the RTO (Recovery Time Objective). A strategy utilizing only a Full Backup and Transaction logs will have a longer RTO than a strategy utilizing a Full Backup followed by Differential backups and transaction logs.

Anchor
_Toc469044914
_Toc469044914
VMWare Copies

A full copy of the VMWare server.

Servers

Schedule

Retention

Application

Once every two weeks

One month

Visualization

Once every two weeks

One month

Process Historian

Once every two weeks

One month

SQL Server

Once every two weeks

One month

OPC

Once a month

One month

Snapshots

Servers

Schedule

Retention

Application

Every day

2 days

Visualization

Every day

2 days

Process Historian

Every day

2 days

SQL Server

Every day

2 days

OPC

Every day

2 days

Anchor
_Toc469044916
_Toc469044916
Backups

Anchor
_Toc469044917
_Toc469044917
SQL Databases

Server

Backup Type

Schedule

Retention

SQL Server

Full

Every day

2 days

Differential

Every 12 hours

2 days

Transaction Log

Every 15 minutes

2 days

Note: If performing re-indexing its best to schedule it just before a full backup as it will create large differential backups if performed after a full backup.

This suggests an RPO Tolerance of 15 minutes. In theory, after a disaster, you could recover your data to within 15 or fewer minutes of the disaster. The RTO is hard to estimate because it would depend on the amount of data in the Transaction Log Files. If it is critical for a shorter RTO, then more Differentials should be performed during the day.

Cleanup

Anchor
_Toc469044919
_Toc469044919
An example of disaster recovery.

#

Full Backup

Differentials

Transaction logs

Disaster

12:00 AM

1

12:15 AM

2

12:30 AM

3

12:45 AM

4

1:00 AM

5

1:15 AM

6

1:30 AM

7

1:45 AM

8

2:00 AM

9

2:15 AM

10

2:30 AM

11

2:45 AM

12

3:00 AM

13

3:15 AM

14

3:30 AM

15

3:45 AM

16

4:00 AM

17

4:15 AM

18

4:30 AM

19

4:45 AM

20

5:00 AM

21

5:15 AM

22

5:30 AM

23

5:45 AM

24

6:00 AM

25

6:15 AM

26

6:30 AM

27

6:45 AM

28

7:00 AM

29

7:15 AM

7:25

 Disaster 1

30

7:30 AM

31

7:45 AM

32

8:00 AM

33

8:15 AM

34

8:30 AM

35

8:45 AM

36

9:00 AM

37

9:15 AM

38

9:30 AM

39

9:45 AM

40

10:00 AM

41

10:15 AM

42

10:30 AM

43

10:45 AM

44

11:00 AM

45

11:15 AM

46

11:30 AM

47

11:45 AM

48

12:00 PM

12:00 PM

49

12:15 PM

50

12:30 PM

51

12:45 PM

52

1:00 PM

1:05 PM

Disaster 2

53

1:15 PM

54

1:30 PM

55

1:45 PM

With Disaster #1 (at 7:25 AM) the following recovery procedures would occur:

  1. Restore full backup which was taken at 12:00 AM

  2. Restore each transaction log in ascending sequential order from log # 1 to 29. (this is 29 distinct recovery commands).

10 minutes of data would be lost from 7:15 to 7:25
With Disaster #2 (at 1:05 PM) the following recovery procedures would occur:

  1. Restore full backup which was taken at 12:00 AM

  2. Restore the Differential Backup taken at 12:00 PM

  3. Restore each transaction log in ascending sequential order from log # 49 to 52. (this is 4 distinct recovery commands).

5 minutes of data would be lost from 1:00 to 1:05
If the differential backup had not been done, then a full backup followed by 52 transaction log backups would have to have been performed.
Notes:

  • In the scenarios above the time that it takes to process the recovery of each transaction log is the main factor in the RTO calculation.

  • The full backup, differential backup, and transaction logs have to be from the same "set". A previous full backup would not work.

  • Regarding data loss and MES data specifically, most transactions that come as a result of process historian data can be "re-triggered" and reprocessed very easily to the same point in time of the disaster. The transactions that are at risk to be missed are any transactions that were manually entered by operators or downloaded from 3rd party applications via interfaces.

Anchor
_Toc381097023
_Toc381097023
SQL AGENT Database Backup

An SQLAgent job can be set up to perform Maintenance jobs to perform Database and transaction log backups.
For example, you might schedule three jobs: Full Backup, once per day, a Differential, every 12 hours, and Transaction Log backup every 15 minutes or half-hour. Database and Transaction Log backups would then be sent to an offline storage device like a NAS server.

Historian Archives and OPC Project Configuration Files

Server

Backup Type

Schedule

Retention

Historian (Off Site Backup)

Full (all archives)

Once

Permanent

Latest 3 archives

Daily

Keep one copy of each
Purge duplicates

OPC Servers (Off Site Backup)

Last "project" file (.opf))

Weekly or on ad-hoc change.

Keep last two current ones