Server Monitoring


Documentation

The "as built" MES environment should be documented and available to Support personnel.
It should include topics such as:


Server name,
IP addresses,
Domain,
Purpose,
Main applications and versions installed,
Backup Strategy
Notes
Operating System
Inventory of Drives and sizes,
Free space alerts (email)
Memory (RAM)
CPU's/Cores
For Support/Troubleshooting
VPN Access information
Contact information for key personnel

Server monitoring

Server monitoring is designed to determine whether the server is appropriately sized for the load that it's running. This document won't be able to guess the tools available to the I.T. department to perform this monitoring.
All servers in the MES environment should be monitored but especially the Applications, Process Historian and SQL servers. The OPC servers should be monitored for a short period after installing and configuring tags to determine if they are sized correctly.

Task Manager

Monitoring can be performed manually by using Task Manager. Task Manager provides information about applications currently running on your system, the processes and memory usage or other data about those processes, and statistics about memory and processor performance. Although useful as a quick reference to system operation and performance, Task Manager lacks the logging and alert capabilities so it should be used as a current real time reference of how the server is operating "now".
The first time you open task manager on Windows Server, you will be presented with a minimal display.


By clicking "More Details" it reveals more details.


Using the more details view, we can see how any process is doing (CPU, Memory, Disk and Network) and what load it is adding to the server. This will be helpful for troubleshooting slow response issues just by looking at it and quickly know who's using the most of the disk, memory or network bandwidth.


If you expand a process, you will see more detail such as CPU, memory disk and network used by the process. For example, expanding the "Service Host : Network Service" you will be able to see the services that are running inside it.


If you want to stop a service, you can do it by right-clicking on it.


The "Performance" tab make it easy to get all the required information about the server performance in one place.
CPU Usage : Shows all the information you need to know about your CPU, Type of CPU, Clock speed, Total sockets, the number of cores, number of the logical processors being exposed to the OS and also if the CPU supports virtualization or not.
This tab also shows the CPU usage since the Task Manager was opened.
Memory: This tab displays information about:

  • In use – Memory used by process, drivers and the OS

  • Modified – Memory contents that must be written to disk before it can utilize for another purpose.

  • Standby – Memory that contains cached data that is not utilized.

  • Free – Memory that being used.


Disk: The graph shows information about disk drives connected to the server.
The first graph shows the % disk activity in the last 60 seconds since the task manager was open.
The second graph illustrates the speed at which data is read or written to the disk in KB/MB per second.
Network : The graph shows the network throughout of the last 60 seconds since the task manager was open. It also shows the current upload and download speed with the tyoe of connection and ipv4 and Ipv6 address.

Event Viewer


Event Viewer provides historical information that can help troubleshooting, track down system and security issues. The Window Logs category that are available:

  • Application, Security,

  • Setup, System,

  • Forwarded Events, and

  • Applications and Services


Application Log: Records events logged by applications. For example, a SQL Database might record a database connection error.


Security Log: Records events such as valid and invalid logon attempts, creating\deleting files or other objects. Records events that you have set for auditing with local or global group policies (GPOs).

Setup Log: Records events related to application setup.


System Log: Records events logged and predetermined by Windows system components. For example, failure of a driver, failure of system component to load at startup.


Forwarded Events Log: It is used to store events collected from remote computers. To collect events from remote computers, you must create an event subscription. To learn about it, see Event Subsciption on Microsoft Technet site.


Application And Service Logs: Records events from a single application or component instead that event that might have system wide impact.


The following table lists the common event properties:

Proporty Name

Description

Source

 

Event ID

 

Level

 

User

 

Option code

 

Log

 

Task Category

 

Keyword

 

Computer

 

Date and time

 

 

Hard Drive Space (Free Space)

If your server runs out of disk space, then it will obviously affect server performance. All data files (i.e.,. SQL Server databases) and log files should be configured to reside on the largest drive on the server.
It is important to regularly monitor the available free disk space on:
Process Historian: The drive(s) where the active archives reside
The drive(s) where the daily backups reside.
SQL: The drives where the .mdf and ldf files reside
The drives where the tempdb files reside.
Application Server:The drive where the log files reside.
How to Troubleshoot Disk Space Usage
Check the following:

  • Check the size of the Proficy log files

  • Check the size of the SQL Server database files.

  • Check the size of the SQL Server log files.

  • Check the memory allocated to SQL Server because, if SQL server has exceeded its dynamic memory allocation, then it may generate large swap files.

Below is some code that can be used to check and purge Historian log files that are older than 30 days and are on the path "C:\Proficy Historian Data\LogFiles. You can create a batch file and run it on a weekly or monthly interval using Windows task scheduler. There are Powershell options also available.
:: Checking for log files older than 30 days
forfiles -p "C:\Proficy Historian Data\LogFiles" -s -m *.log -d -30 -c "cmd /c del @path"
IT teams have various tools available to them for checking and alarming on Hard Drive minimum thresholds. A good preventative maintenance plan will have one monitoring and alarming on the key servers. We have experienced several production outages over the years due to SQL, Application, Historian or OPC servers failing after they ran out of hard drive space.

Create and Confirm snapshots are being performed

  1. Configure weekly, nightly, or hourly Snapshot schedules of Servers.

Applications server Daily
SQL Server Daily
Process HistorianDaily
Visualization Server Daily or Weekly

  1. Specify the number of Snapshot copies to be retained and duration

  2. Generate Syslog Messages for Server Actions and have automatic alarms sent to appropriate persons on snapshot failures.

Monitor Server uptime

There is some debate on whether servers need to be rebooted now and then. This is especially true for SQL servers that can experience Port exhaustion
Port exhaustion can cause all kinds of problems for your servers. Here's a list of some symptoms:
– Users won't be able to connect to file shares on a remote server – DNS name registration might fail – Authentication might fail – Trust operations might fail between domain controllers – Replication might fail between domain controllers – MMC consoles won't work or won't be able to connect to remote servers.
Suffice it to say that it would be a good idea to reboot servers at least once or twice a year. This may occur naturally with the implementation of Updates or Hotfixes. To determine the last time a server was rebooted or server uptime you can run the following command.

  • Go to "Start" -> "Run".

  • Write "CMD" and press on "Enter" key.

  • Write the command "net statistics server" and press on "Enter" key.

  • The line that start with "Statistics since …" provides the time that the server was up from.

Historian System Statistics

GE Hist 55_Using_Historian_AdministratorThe System Statistics screen, as shown in the following figure, displays current system status and performance statistics. It presents an overall view of system health. The screen has three sections:

  •  

    • System Statistics Section

    • Collectors Panel

    • Alerts Panel

System Statistics

NOTE: The statistics displayed on this screen are computed independently on various time scales and schedules. As a result, they may update at different times.

The Field

Display

Receive Rate (a time-based chart in events/minute)

How busy the server is at a given instant – the rate at which the server is receiving data from all collectors.

Archive Compression (% compression)

The current effect of archive data compression. If the value is zero, it indicates that archive compression is either ineffective or turned off. To increase the effect of data compression, increase the value of archive compression deadbands on individual tags in the Tag Maintenance screen to activate compression. In computing the effect of archive compression, Historian counts internal system tags as well as data source tags. Therefore, when working with a very small number of tags and with compression disabled on data source tags, this field may indicate a value other than zero. If you use a realistic number of tags, however, system tags will constitute a very small percentage of total tags and will therefore not cause a significant error in computing the effect of archive compression on the total system.

Write Cache Hit

The hit ratio of the write cache in percent of total writes. It is a measure of how efficiently the system is collecting data and should typically range from 95 to 99.99%. If the data is changing rapidly over a wide range, however, the hit percentage drops significantly because current values differ from recently cached values. More regular sampling may increase the hit percentage. Out of order data also reduces the hit ratio.

Failed Writes

The number of samples that failed to be written. Since failed writes are a measure of system malfunctions or an indication of offline archive problems, the value shown in the display should be zero. If you observe a non-zero value, investigate the cause of the problem and take corrective action. The Historian also generates a message if a write fails. Note that the message only appears once per tag, for a succession of failed writes associated with that tag. For example, if the number displayed in this field is 20, but they all pertain to one Historian tag, you will only receive one message until the Historian tag is healthy again.

Messages Since Startup

A count of system messages generated since the last startup. The system resets the value to zero on restart. The message database, however, may contain more messages than this number indicates.

 

 

The Field

Display

Alerts Since Startup

A count of system warnings or alerts generated since the last startup. A high value here may indicate a problem of some kind. You should review the alerts and determine the probable cause. The count resets to zero on restart. The message database, however, may contain more alerts than this number indicates.

Calculations

Appears as Enabled if the Calculation Collector feature is licensed on the software key.

Server-to-Server

Appears as Enabled if the Server-to-Server Collector feature is licensed on the software key.

Alarms since Startup

A count of alarms received by the Historian Data Archive since starting up.

Server Memory

How much server memory the Historian Data Archive is consuming.

Free Space (MB)

How much disk space (in MB) is left in the current archive.

Consumption Rate (MB/day)

How fast you are using up archive disk space. If the value is too high, you can reduce it by slowing the poll rate on selected tags or data points or by increasing the filtering on the data (widening the compression deadband to increase compression).

Est. Days to Full (Days)

How much time is left before the archive is full, based on the current consumption rate. At that point, a new archive must be opened (could be automatic). To increase the days to full, you must reduce the Consumption Rate as noted above. To ensure that collection is not interrupted, you should make sure that the Automatically Create Archives option is enabled in the Data Store Maintenance screen (Global Options Tab). You may also want to enable Overwrite Old Archives if you have limited disk capacity. Enabling overwrite, however, means that some old data will be lost when new data overwrites the data in the oldest online archive. Use this feature only when necessary. The Estimated Days Until Full field is dynamically calculated by the server and becomes more accurate as an archive gets closer to completion. This number is only an estimate and will vary based on a number of factors, including the current compression effectiveness. The System sends messages notifying you at 5, 3, and 1 days until full.

 

 

 

The Field

Display

Active Tags

Number of tags in your configuration.

Licensed Tags

How many tags are authorized for this Historian installation by the Software Key and License.If this field displays 100 tags and the Licensed Users field displays 1 client, you are likely running in demonstration mode and you may have incorrectly installed your hardware key.

Active Users

The number of users currently accessing the Historian system.

Licensed Users

The number of users authorized to access the Historian application by the Software Key and License.The number of users that are authorized to access Historian is strictly based on the Software Key and License. However, if you have utilized your available Client Access Licenses (CAL) and need an additional one to administer the system in an emergency, you have an option to reserve a CAL.This reserved CAL allows you to access the server. To do so, provide the reserved CAL to the system administrators and add them to the ih Security Admins group. A system administrator will be able to connect to Historian in an emergency.This facility is optional and does not provide a guaranteed connection. This only eliminates the emergency situations when a CAL is preventing you from accessing the system and may not work if there are other conditions. For example, if Historian is busy, you will not be able to connect using this feature.If this field displays 1 client and the Licensed Tags field displays 100 tags, you are likely running in demonstration mode and you may have incorrectly installed your hardware key. Refer to Installing the Hardware Key for more information

Alarm Rate

Displays the rate at which Historian is receiving alarm and event data.

SCADA Tags

Displays the number of Proficy Cimplicity or Proficy iFIX tags.

Tags Consumed by Arrays

Indicates the total number of Array tags consumed by Proficy Historian.

Collectors panel Statistics section

The Collectors panel shows current statistics on the operation of all connected data collectors in the system. For more information on a particular data collector, click the name of the Data Collector you want to examine. The Collector Maintenance Screen for that collector then appears. You can also display the Collector Maintenance screen by clicking on the Collector link in the top line of the System Statistics screen.
To automatically refresh the collector's panel statistics, select Auto option in the collector's panel. Selecting Auto option will automatically refresh the collector's panel statistics for every 45 seconds.
You can also use the refresh button to manually refresh the collector's panel statistics. To refresh the statistics, click the Refresh button on the Collectors Panel.
The Collectors panel of the System Statistics screen displays data described in the following table.

The Field

Display

Collector

The collector ID, which is used to identify the collector in a Historian system.

Status


The current status of the collection. "Running" indicates that the collector is operating. "Stopped" indicates that it is in pause mode and not collecting data. "Unknown" indicates that status information about the collector is unavailable at present, perhaps as a result of a lost connection between collector and server.

Computer

The name of the computer the collector is running on.

Report Rate

The current rate in a number of samples/minute at which the server is receiving data from the collector. It is a measure of the collection rate and also of data compression activity. A value equal to the data acquisition rate, when Collector Compression Percent is zero, indicates that every data value received from the data source is being reported to the server. This means that the collector is not performing any data compression. You can lower the report rate, and make the system more efficient, by increasing the data compression at the collector. To do this, widen the collection compression deadbands for selected tags.

The Field

Display

Overruns

The overruns in relation to the total events collected since startup. This value is calculated by using the following equation: OVERRUN_PCT = OVERRUNS / ( OVERRUNS + TOTAL_EVENTS_COLLECTED ). Overruns are a count of the total number of data events not collected on their scheduled polling cycle. In normal operation, this value should be zero.You may be able to reduce the number of overruns on the collector by increasing the tag collection intervals (per tag).

Compression %

Percentage of how effective compression is at present for the specific collector since collector startup. A value of zero indicates that compression is either turned off or not effective. To increase the value, enable compression on the collector's associated tags and increase the width of the compression deadband on selected tags. The collector keeps track of how many samples it collected from the data source (OPC Server for example) and keeps track of how many samples it reported to the Historian data archiver (after collector compression is complete). A low number or zero means most everything coming from the data source is being sent to the Historian data archiver. The reason for the low number or zero is that too many samples are exceeding compression or you are not using collector compression. A high number or 100 means you are collecting a lot of samples, but they are not exceeding collector compression and therefore are not being sent to the server.

Out of Order

How many samples within a series of timestamped data values normally transmitted in the sequence have been received out of sequence since collector startup? This field applies to all collectors. Even though events are still stored, a steadily increasing number of out of order events indicates a problem with data transmission that you should investigate. For instance, a steadily increasing number of out of order events when you are using the OPC Collector means that there is an out of order between OPC Server and the OPC Collector. This may also cause out of order between the OPC Collector and the data archiver but that is not what this statistic indicates.

Alerts Panel

The Alerts panel displays all alerts and warnings received or generated by the system. You can scan through these messages by using the scroll bar at the right of the window. It displays the system timestamps and records of each message in this window.
To stop automatic updating of the display in the Historian Non-Web Administrator, clear the Show Alerts check box. This setting will be reset when you restart the Non-Web Administrator.
To automatically refresh the alerts panel statistics, select Auto option in the alerts panel. Selecting Auto option automatically refreshes the last five seconds alerts panel statistics for every 25 seconds.
You can also use the refresh button to manually refresh the alerts panel statistics. To refresh the statistics, click the Refresh button on the alerts panel.
The Alerts panel of the System Statistics screen displays data described in the following table:

The Field

Display

Timestamp

The timestamp associated with the message or alert.

Topic

The type of alert message. Only the Services and Performance alerts appear here. A total of up to 250 of the most recent messages will be displayed.

Message

The content of the message or alert.





Alerts/Messages

The Message Search screen, shown in the following figure, lets you enter search parameters, such as start and end times, and to limit the search to alerts only or messages only. It further refines the search by topic and a text mask.

  1. Enter a start time and end time (required). If your start date and end date are identical, you must enter a timestamp with the date.

  2. Select All, Alerts, or Messages.

  3. Select a Topic (optional).

  4. Enter a text mask (optional). If you do not specify a text mask, all items for the associated alert or message will be returned. Use a text substring for a mask. The Message contains field does not accept wildcards.

Click Search. The results of the search will be displayed in the scrolling window at the right of the screen.

AutomaTech Inc.