Linux super-duper admin tools: health-check

Updated: July 17, 2019

Good system troubleshooting tools are everything. Great tools, though, are harder to find. Luckily, Linux comes with a wealth of excellent programs and utilities that let you profile, analyze and resolve system behavior problems, from application bottlenecks to misconfigurations and even bugs. It all starts with a tool that can grab the necessary metrics and give you the data you need.

Health-check is a neat program that can monitor and profile processes, so you can identify and resolve excess resource usage - or associated problems. Where it stands out compared to the rest of the crowd - it aims to offer many useful facets of system data simultaneously, so you can more easily component-search your systems, troubleshoot performance issues and fix configuration mishaps in your environment. Rather than having to run five tools at the same time, or do five runs to get all the info you need, you just use health-check, and Bob's your distant relative. Good. All right, ready? Proceed.

Health-check in action

Before we run the utility in anger, a few small notes. One, you need sudo privileges to run this tool, although you can run the actual application in the context of other users on the system (with the -u flag). Two, you need some understanding of how Linux works to make use of the results - I have a whole bunch of articles on this topic, linked farther below.

In essence, as I've outlined just a moment ago, health-check blends functionality from a variety of programs under one umbrella. It nicely blends elements that you would get if you run netstat, lsof, vmstat, iostat, and examined various structs under /proc and /sys. This is somewhat like dstat, which combines the power of vmstat, iostat and ifstat. You can start with the simple run (-b flag):

sudo ./health-check -u "user" -b "binary"

There's going to be a lot of output, even in this "brief" mode, something like:

CPU usage (in terms of 1 CPU):
 User:  34.24%, System:  13.30%, Total:  47.54% (high load)

First, you get basic CPU figures, normalized per core (100% = 1 full core). Health-check has internal thresholds by which it will indicate whether this is low, moderate or high load. This is just to give you a sense of what you should expect. The specifics will depend on the type of application and workload you're profiling. There will be differences between GUI and command-line tools, software that reads from a database to one that does not, software with a large number of shared libraries, the use of hardware, etc.

Page Faults:
   PID Process                 Minor/sec    Major/sec    Total/sec
  1043 google-chrome            16156.46         0.25     16156.71

We talked about page faults at length in the past (links in the more reading section below). If you don't know what your application is supposed to be doing, the numbers won't tell you much on their own. But they can be very useful for comparative studies, like two different programs of the same type, or two different versions of the same program, or the same program running on two different platforms.

Context Switches:
 13687.53 context switches/sec (very high)

The context switch value indicates how often the kernel relinquishes the runqueue and switches between tasks. For interactive processes (like the browser), which have a user-interactive component, you do actually want as many context switches as possible (the tasks run as little as possible), because you don't want these tasks hogging the processor. In fact, long computation is a sign of batch jobs. Here, having few context switch could be an indication of an issue with an interactive (GUI) application like the browser.

File I/O operations:
  I/O Operations per second: 312.80 open, 283.49 close, 768.71 read,
                             410.83 write

The I/O values are useful if you have a baseline, and they also depend on the underlying I/O stack, including the hardware, the bus, the driver, the filesystem choice, and any other disk operations running at the same time.

Polling system call analysis:
 google-chrome (1043), poll:
       1555 immediate timed out calls with zero timeout (non-blocking peeks)
          1 repeated timed out polled calls with non-zero timeouts
            (light polling)
       1125 repeated immediate timed out polled calls with zero timeouts
            (heavy polling peeks)

This section is another indicator of possible user-interactiveness of the profiled binary. Polling system calls are system calls that wait for file descriptions to become ready to perform I/O operations. Typically, this will indicate network connections. We will examine this in more detail when we do a full run.

Memory:
Change in memory (K/second):
   PID Process         Type        Size       RSS       PSS
  1043 google-chrome   Stack      32.51     27.59     27.59 (growing
                                                             moderately fast)
  1043 google-chrome   Heap    67550.94   9092.51   9112.70 (growing very
                                                             fast)
  1043 google-chrome   Mapped 102764.33  27296.24  18781.80 (growing very
                                                             fast)

For most people and most application types, memory operations will not be that interesting. Memory-intensive workloads are not that usual in desktop software. They can be quite important for databases and complex computations, something you'd normally do on a server-class system. But this set of numbers can be used to examine differences between platforms, kernels, and software versions.

Open Network Connections:
   PID Process         Proto       Send   Receive  Address
  1043 google-chrome   UNIX    531.52 K   35.16 K  /run/user/1000/bus
  1043 google-chrome   UNIX      0.00 B   88.56 K  /run/systemd/journal/...
  1043 google-chrome            64.51 K    0.00 B  socket:[2737924]
  1043 google-chrome            30.55 K    0.00 B  socket:[2746558]
  1043 google-chrome             3.98 K    0.00 B  socket:[2742865]
...

The network connections set of numbers gives you results that are similar to what netstat and lsof do, but you also get the send/receive values, which can be quite useful. If you know what the program is meant to be doing network-wise, you can profile its execution, and look for possible misconfigurations in the network stack.

Longer (detailed) run

You can also opt for more statistics (for instance, with -c -f flags, no -b flag). You will get extended results for each of the sections we've discussed earlier, and this can give you additional insight into how your software is behaving. If you're tracing children processes and forks, then you can see the sequence of execution. CPU statistics will be listed based on usage, with the highest offenders at the top.

CPU usage (in terms of 1 CPU):
   PID Process           USR%   SYS% TOTAL%   Duration
  1715 vlc              47.04   8.17  55.21      14.71  (high load)
  1720 vlc              46.91   7.96  54.87      14.43  (high load)
  1723 vlc              46.77   7.96  54.74      14.35  (high load)
...
  1721 vlc               1.69   1.08   2.77       1.21  (light load)
  1722 vlc               0.20   0.07   0.27       0.16  (very light load)
  1726 vlc               0.07   0.00   0.07       0.02  (very light load)
  1742 vlc               0.00   0.00   0.00       0.07  (idle)
  1732 vlc               0.00   0.00   0.00       0.02  (idle)
  1728 vlc               0.00   0.00   0.00       0.06  (idle)
  1719 vlc               0.00   0.00   0.00       0.06  (idle)
 Total                 971.80 161.72 1133.52            (CPU fully loaded)

In the example above, running VLC (with an HD clip playback of about 14 seconds), we utilized 1,133% of CPU time, which translates into 11.33 CPU cores. That sounds a lot, but since the system has eight cores (threads), this effectively means only 1.5 cores were actually used for the video. It would also be interesting to actually know which cores were used.

Context Switches:
   PID Process           Voluntary   Involuntary     Total
                       Ctxt Sw/Sec  Ctxt Sw/Sec  Ctxt Sw/Sec
  1744 vlc                 2500.09         1.15      2501.24 (high)
  1723 vlc                 1493.47         1.82      1495.29 (high)
  1740 vlc                 1224.03         3.31      1227.33 (high)
  1717 vlc                  947.43         0.40       947.84 (quite high)
  1731 vlc                  736.37         0.81       737.18 (quite high)

For page faults, there isn't anything new. With context switches, we also get a list of voluntary and involuntary CS. The latter bunch can be an indication of tasks exceeding their allocated slice, which would then result in them having a lower dynamic priority the next time they run (which is not good for interactive processes).

File I/O operations:
   PID Process   Count  Op  Filename
  1715 vlc         176    R /home/roger/developers.webm
  1715 vlc          48    C /etc/ld.so.cache
  1715 vlc          48    O /etc/ld.so.cache
  1715 vlc          34    R /usr/share/X11/locale/locale.alias

The File I/O now also shows the number of operations per process, the type of operation as well as the filename. This does not have to be an actual file on the disk, this could also be a bus. The available operations are printed at the bottom of this section. How exactly the read and write operations are done depends on many factors.

...
  1715 vlc           1    C /lib/x86_64-linux-gnu/libnss_systemd.so.2
  1715 vlc           1   OR /usr/bin/vlc
 Total            4352
 Op: O=Open, R=Read, W=Write, C=Close

You also get the frequency of these I/O operations:

File I/O Operations per second:
   PID Process                 Open   Close    Read   Write
  1715 vlc                   100.57   96.72   89.77    1.75
  1719 vlc                     3.24    4.99    3.04    0.00
...

The next section is all about system calls, and it's very detailed. The output is similar to strace. You will have the process ID, the process name, the system call, the count, rate, total time (in us), and the percentage each system call took out of the total execution time. You cannot interpret these numbers unless you know what the application is meant to be doing, or you can compare to a baseline.

System calls traced:
   PID Process     Syscall     Count    Rate/Sec    Total uSecs  % Call Time
  1715 vlc         stat        429      28.9555          3004      0.0011
  1715 vlc         mmap        240      16.1989          4912      0.0017
  1715 vlc         mprotect    203      13.7015          4739      0.0017
...

The information is more useful when you look at polling system calls. For brevity, I've slightly edited the output below. The last four fields all indicate timeouts, e.g.: Zero Timeout, Minimum Timeout, etc. Essentially, they give you an indicator of how much time it took for these system calls to finish. The Infinite count field refers to system calls that had infinite wait (for the duration of the application run). The information is also shown as a histogram per process, from zero to infinite, bucketed logarithmically, up to 10us, 10-99 us, 100-99 us, and so forth.

Top polling system calls:
   PID Process  Syscall     Rate/Sec  Inf  Zero    Min     Max      Avg
  1715 vlc      poll          3.2398   45     1  0.0 s  25.0 s    1.0 s
  1715 vlc      rt_sigtimed   0.1350    2     0  0.0 s   0.0 s    0.0 s
  1717 vlc      poll        124.7312    5     1  0.0 s  30.0 s  958.4 ms
...

The more detailed output log will also have the filesystem sync data.

Filesystem Syncs:
   PID  fdatasync    fsync     sync   syncfs    total   total (Rate)
  1723          0        2        0        0        2     0.13

Files Sync'd:
   PID  syscall    # sync's filename
  1723  fdatasync         1 /home/roger/.../vlc-qt-interface.conf.lock
  1723  fdatasync         1 /home/roger/.../vlc-qt-interface.conf.XM1715

Lastly, the detailed output will also have memory and network connection information, but the main difference will be the results shown per process. As we've discussed earlier, the former will not typically be useful for most desktop workloads (unless you're the developer of the program), while the latter can be useful in finding issues in the network stack.

Health-check creates a pretty large set of results, but it does provide a lot of insight into how your applications behave. You can combine its usage with other software to get a full analysis of your software and troubleshoot any performance issues. Health-check can also profile running tasks (-p flag), which makes it quite handy as an addition to your problem solving toolbox.

Manual setup

If you're not happy with the version available in the repos, you can manually compile. Another reason to do this is to work around any possible problems that older versions may have, like for instance the timer_stats error, whereby the tools tries to access /proc/timer_stats, but this struct is no longer exposed in recent kernels:

Cannot open /proc/timer_stats.

Indeed, if you check, you get:

cat /proc/timer_stats
cat: /proc/timer_stats: No such file or directory

To compile, run:

git clone https://kernel.ubuntu.com/git/cking/health-check.git/
cd health-check
make

You may see the following error:

json.h:25:10: fatal error: json-c/json.h: No such file or directory
#include <json-c/json.h>

This means you're missing the development package for JSON, which the tool needs to successfully compile. The actual name of the package will vary from one distribution to another, but in my test on Kubuntu, the following resolved the compilation error:

sudo apt-get install libjson-c-dev

More reading

If you're interested in additional tools on system troubleshooting, then:

Linux super-duper admin tools: strace

Linux super-duper admin tools: lsof

Linux super-duper admin tools: gdb

Slow system? Perf to the rescue!

Linux system debugging super tutorial

Linux cool hacks - parts one through four - linking just the last one.

Last but not the least, my problem solving book!

Conclusion

Health-check is a very useful, practical tool. It does not replace strace or netstat or perf, but it can sure help you get a very accurate multi-dimensional snapshot of whatever you're profiling. This is a very good first step that can point you in the right direction. You can then select a utility that specifically examines the relevant facet of the software run (maybe Wireshark for network or Valgrind for memory). In a way, this makes health-check into a Jack o' All Trades.

You do need some understanding of how Linux systems work - and the application you're running. But even if you don't have that knowledge, health-check can be used for comparative studies and troubleshooting of performance bottlenecks. If you know something isn't running quite as well as it should, you can trace it once on a good system, once on a bad (affected) system, and then compare the two. The many types of data that health-check provides will greatly assist in solving the issue. And that brings us to the end of this tutorial. With some luck, you have learned something new, and it was an enjoyable ride, too. Take care.

Cheers.

You may also like: