Server monitoring is important for optimum server performance to ensure no disruptions to your business. However, server performance monitoring can be dispersed and complex. Keeping an eye on everything has become an uphill battle. Information on the server allows you to better understand what went wrong. Tools, like Retrace, that make this uphill battle more streamlined and manageable. Let’s learn how to monitor server performance.
What is Server Monitoring?
Server monitoring involves keeping an eye on various metrics to ensure its smooth operation. Monitoring different metrics help to easily pinpoint bottlenecks.
Behind every business-critical online service, there are typically multiple servers responsible – physical or virtual. A physical server may run multiple engines, resulting in multiple server functions. Some examples of physical servers are database servers, application servers, and web servers
Why Server Monitoring is Important
Server Monitoring is essential to proactively identify any performance issues before they impact the end user. Also, server monitoring helps in understanding of the server’s system resource usage. This lets you better plan the capacity of the server.
Monitoring the server provides a good indication of the server’s responsiveness and availability – all in the name of ensuring no disruption in the delivery of your service to your customers.
Monitoring metrics can also indicate a cybersecurity threat. This is essential with web hosting where exposure to the web can result in an increased threat web server profile.
How to Monitor Server Performance
Caption: In web hosting, control panels often include monitoring tools that can help show usage of various resources.
The key to a successful server monitoring strategy is to identify the areas to focus on and create a performance baseline. Thisproperly interprets your server performance for alerting purposes and reap value-added information via reporting.
There are server monitoring tools that can help you with this. They can also help monitor the applications or the entire infrastructure as well. Stackify Retrace is an excellent tool for ensuring a successful server monitoring strategy. Stackify’s Retrace APM solution gives you a bird’s eye view of your server’s stack. The Retrace platform automatically analyzes all the applications that contribute to your IT framework, giving you the ability to monitor a wide range of performance-based metrics and take action before small errors and inconsistencies spiral out of control. Retrace gives your team:
- App performance monitoring
- App management functions
- A centralized logging tool
- A line-by-line view of your code and how it fits with the bigger picture
- Robust error tracking reports
- A suite of real-time server monitoring functions
- Individual user monitoring functions
An all-in-one performance monitoring solution, like Retrace, lets you easily dissect your server stack and pinpoint areas of weakness before a larger, catastrophic failure occurs. It gives you a long view of how your server and its constituent apps function under network load.
Key Areas to Monitor
Whether your servers are running on Windows or Unix, these key performance areas serve as a good starting point for any server monitoring strategy. It is important to track these performance metrics as indicators of performance bottlenecks.
Server’s Physical Status
This applies to on-site servers; On-site servers need protection from environmental hazards and damage. Aside from keeping the servers in a secure room, you need to ensure that the temperature and power supply of the servers.
The temperature cannot exceed the recommended level for efficient performance in your server environment. If the temperature starts to consistently increase, it could signal a fan problem or something else. You’ll need to investigate further.
You also need to monitor the power supply regulators on your server’s power input. They must manage and smooth out power surges and dips. However, should the main supply break, your Uninterruptible Power Supply (UPS) can buy you some time to switch over to the backup power.
Central Processing Unit (CPU) & Memory
Whenever a server performance degrades, the usual suspects are server CPU utilization and memory resources. If the CPU usage of your server is unusually high or there is high memory utilization (less free memory space available), your applications’ performance will suffer.
It is good to know what are the top CPU and memory-consuming processes on your server. This is important for fixing resource usage issues quickly. The metrics to measure include CPU Process Count, CPU Thread Count, and CPU % Interrupt Time.
You’ll need to monitor the memory usage of your server. This includes available free memory, the rates pages are written to free up physical memory space, among others. All these metrics can help you understand the health of your server at all times.
Your website has to be running and available around the clock. The server uptime measures the amount of time a system has been operational. This metric is useful in alerting you when the system may have unknowingly rebooted.
If you discover a discrepancy between the expected server availability period and the server uptime figure, then the system has failed at least once. Confirm if all scheduled tasks expected to run around the same time as when the system failed were completed.
Disk activity is the time taken for a disk drive to actively process requests. There are several key metrics that must be monitored:
- Disk busy time – measures the percentage of time the disk is active. If this value is high, this means that your requests to access the disk are piling up.
- Input and Output operations (I/OPs) – indicates the workload on the disk drive. Monitoring this metric can help understand the workload your disk is undergoing.
- Disk read/write – measures the time taken to read/write blocks of data from the disk. The lower value means performance is good.
- Disk queue length – measures the time taken to service a request in a queue. For best performance, the disk queue length should be minimal.
Take note that monitoring the performance of the disk is highly crucial for tasks that are heavily I/OPs intensive.
Page File Usage
Unused or unaccessed data is stored in the page file.Operations that exceed the limited random-access memory (RAM) space of the operating system (OS) are also sent to the page file to be stored.
When you find that its usage is high, this means that the paging file of the system is not sufficient to cater to your server’s needs.
Another important metric is page swapping. Whenever your server is running out of working memory, an area of disk space is reserved to temporarily save data so as to free up more space. We do not recommend page swapping. Typically, this means that you haven’t provisioned enough memory to run your server.
Remember, page swapping is a short-term resolution to memory capacity exhaustion. Since page swapping reduces response time, this should be avoided.
Context switching is an intensive process. It occurs when the kernel (computer program at the core of a computer’s OS) switches the processor from one process or thread to another. CPU resources are used each time a context switching happen. So when an extensive context switching occurs, more and more important CPU resources are taken up.
This is caused by running multiple busy processes or application bugs that increase the number of context switching. A sudden increase in context switching on a server can indicate a problem. Therefore, monitoring context switches is essential for your server’s performance.
Systems on the same network that share files or communicate with one another have time-bound activities. So, imagine if the system clocks are not synchronized? The results could be disastrous.
Inaccurate clocks could cause data to be overwritten or create version conflicts. Worse case, it can cause programs to function incorrectly. Always monitor system clock offsets against a reference clock.
Handles refer to the resources an application makes reference to. The applications running on your server request and receive resources use them after which they are returned to the OS. At times, due to a program error, the application ‘forgets’ to return the handle after use. This is a handle leak.
Remember that resources on a server are finite. Repeated handle leaks may ‘exhaust’ the server over time, causing the server’s performance to degrade. Monitor and handle usage closely over time. If the number of open handles increases drastically or consistently, this could imply a handle leak.
You’ll need to investigate and identify the culprits. You can either terminate such processes or patch the programs .
There can be instances when an application creates new processes without stopping previously started processes. Handling and multi-tasking across these processes can burden your server.
As a result, your server performance will suffer drastically.Ensure applications run correctly and exit properly. To do so, you need to track and monitor all process activities on your server.
Network activity monitoring is crucial to measuring your server’s performance. Each network interface provides an indication of the network activity load. If the bandwidth usage is nearing maximum speed of the network interface, this could indicate a possible bottleneck.
By constantly monitoring input and output (I/O) activities on the network card, you can spot possible hardware failure or overloading. You can also plan the hardware requirements to ensure optimal server performance.
Your applications are connection-oriented. They utilize TCP as the transport protocol. HTTP, SQL, SMTP use TCP underneath. If the TCP layer performance drops, so does the performance of your application.
There are several important metrics that help with monitoring TCP:
- The connection rate to and from the server helps indicate the server workload.
- The number of connection drops on the server. A high number could indicate a problem.
- % of retransmissions – retransmissions occur when the server does not receive an acknowledgment from the client. Upon timeout, the server has to send out the transmission again. To ensure good TCP performance, keep retransmissions at a minimum. Bear in mind that repeated retransmissions can result in a severe reduction in throughput.
OS Log Files
Probably the most common means of monitoring the health of your server is OS logs as they contain error details, crashes, and other types of abnormalities to help you in troubleshooting any issue.
While Windows offers System, Security, and Application log files, Unix has system log and cron log files stored in the /var/log directory. Regular periodic monitoring, analysis, and alerting of log events can help alert you to any server abnormalities.