Server Monitoring

Documentation

The "as built" MES environment should be documented and available to Support personnel.
It should include topics such as:

Server name,
IP addresses,
Domain,
Purpose,
Main applications and versions installed,
Backup Strategy
Notes
Operating System
Inventory of Drives and sizes,
Free space alerts (email)
Memory (RAM)
CPU's/Cores
For Support/Troubleshooting
VPN Access information
Contact information for key personnel

Server monitoring

Server monitoring is designed to determine whether the server is appropriately sized for the load that it's running. This document won't be able to guess the tools available to the I.T. department to perform this monitoring.
All servers in the MES environment should be monitored but especially the Applications, Process Historian and SQL servers. The OPC servers should be monitored for a short period after installing and configuring tags to determine if they are sized correctly.

Task Manager

Monitoring can be performed manually by using Task Manager. Task Manager provides information about applications currently running on your system, the processes and memory usage or other data about those processes, and statistics about memory and processor performance. Although useful as a quick reference to system operation and performance, Task Manager lacks the logging and alert capabilities so it should be used as a current real time reference of how the server is operating "now".
The first time you open task manager on Windows Server, you will be presented with a minimal display.

By clicking "More Details" it reveals more details.

Using the more details view, we can see how any process is doing (CPU, Memory, Disk and Network) and what load it is adding to the server. This will be helpful for troubleshooting slow response issues just by looking at it and quickly know who's using the most of the disk, memory or network bandwidth.

If you expand a process, you will see more detail such as CPU, memory disk and network used by the process. For example, expanding the "Service Host : Network Service" you will be able to see the services that are running inside it.

If you want to stop a service, you can do it by right-clicking on it.

The "Performance" tab make it easy to get all the required information about the server performance in one place.
CPU Usage : Shows all the information you need to know about your CPU, Type of CPU, Clock speed, Total sockets, the number of cores, number of the logical processors being exposed to the OS and also if the CPU supports virtualization or not.
This tab also shows the CPU usage since the Task Manager was opened.
Memory: This tab displays information about:

In use – Memory used by process, drivers and the OS
Modified – Memory contents that must be written to disk before it can utilize for another purpose.
Standby – Memory that contains cached data that is not utilized.
Free – Memory that being used.

Disk: The graph shows information about disk drives connected to the server.
The first graph shows the % disk activity in the last 60 seconds since the task manager was open.
The second graph illustrates the speed at which data is read or written to the disk in KB/MB per second.
Network : The graph shows the network throughout of the last 60 seconds since the task manager was open. It also shows the current upload and download speed with the tyoe of connection and ipv4 and Ipv6 address.

Event Viewer:

Event Viewer provides historical information that can help troubleshooting, track down system and security issues. The Window Logs category that are available:

Application, Security,
Setup, System,
Forwarded Events, and
Applications and Services

Application Log: Records events logged by applications. For example, a SQL Database might record a database connection error.

Security Log: Records events such as valid and invalid logon attempts, creating\deleting files or other objects. Records events that you have set for auditing with local or global group policies (GPOs).

Setup Log: Records events related to application setup.

System Log: Records events logged and predetermined by Windows system components. For example, failure of a driver, failure of system component to load at startup.

Forwarded Events Log: It is used to store events collected from remote computers. To collect events from remote computers, you must create an event subscription. To learn about it, see Event Subsciption on Microsoft Technet site.

Application And Service Logs: Records events from a single application or component instead that event that might have system wide impact.

The following table lists the common event properties:

Proporty Name	Description
Source
Event ID
Level
User
Option code
Log
Task Category
Keyword
Computer
Date and time

Hard Drive Space (Free Space)

If your server runs out of disk space, then it will obviously affect server performance. All data files (i.e.,. SQL Server databases) and log files should be configured to reside on the largest drive on the server.
It is important to regularly monitor the available free disk space on:
Process Historian: The drive(s) where the active archives reside
The drive(s) where the daily backups reside.
SQL: The drives where the .mdf and ldf files reside
The drives where the tempdb files reside.
Application Server:The drive where the log files reside.
How to Troubleshoot Disk Space Usage
Check the following:

Check the size of the Proficy log files
Check the size of the SQL Server database files.
Check the size of the SQL Server log files.
Check the memory allocated to SQL Server because, if SQL server has exceeded its dynamic memory allocation, then it may generate large swap files.

Below is some code that can be used to check and purge Historian log files that are older than 30 days and are on the path "C:\Proficy Historian Data\LogFiles. You can create a batch file and run it on a weekly or monthly interval using Windows task scheduler. There are Powershell options also available.
:: Checking for log files older than 30 days
forfiles -p "C:\Proficy Historian Data\LogFiles" -s -m *.log -d -30 -c "cmd /c del @path"
IT teams have various tools available to them for checking and alarming on Hard Drive minimum thresholds. A good preventative maintenance plan will have one monitoring and alarming on the key servers. We have experienced several production outages over the years due to SQL, Application, Historian or OPC servers failing after they ran out of hard drive space.

Create and Confirm snapshots are being performed

Configure weekly, nightly, or hourly Snapshot schedules of Servers.

Applications server Daily
SQL Server Daily
Process HistorianDaily
Visualization Server Daily or Weekly

Specify the number of Snapshot copies to be retained and duration
Generate Syslog Messages for Server Actions and have automatic alarms sent to appropriate persons on snapshot failures.

Monitor Server uptime

There is some debate on whether servers need to be rebooted now and then. This is especially true for SQL servers that can experience Port exhaustion
Port exhaustion can cause all kinds of problems for your servers. Here's a list of some symptoms:
– Users won't be able to connect to file shares on a remote server – DNS name registration might fail – Authentication might fail – Trust operations might fail between domain controllers – Replication might fail between domain controllers – MMC consoles won't work or won't be able to connect to remote servers.
Suffice it to say that it would be a good idea to reboot servers at least once or twice a year. This may occur naturally with the implementation of Updates or Hotfixes. To determine the last time a server was rebooted or server uptime you can run the following command.

Go to "Start" -> "Run".
Write "CMD" and press on "Enter" key.
Write the command "net statistics server" and press on "Enter" key.
The line that start with "Statistics since …" provides the time that the server was up from.