Monitoring processes and resources

General purpose diagnostic programs (memory, CPU, I/O)

`top`/`htop`

We can use ps to list the currently running processes but it does not provide much information about the resource metrics or how the process changes over time. We can use top to get more information.

top provides an interactive interface for the information that ps displays. It updates in real time and shows the most active processes based on the CPU time that they are utilising. You can also order by memory usage.

Here I have pressed u to show only the processes associated with my user:

Main commands

Command	Action
-u	Show processes by selected user
M	Sort by memory usage
P	Sort by cumulative CPU usage
?	View key and explanation

Understanding the categories

Main/IO
- The first covers all processes. The second focuses on input/output processes (i.e. reading and writing to disks and other devices)
PRI
- This stands for priority. This metric reflects the kernel’s current schedule priority for the process. The higher the value, it is less likely that the kernel will schedule the process if there are competing processes that require CPU time. The lower the value, the greater priority this process has over others.
NI
- This stands for nice value. This metric exists in order to allow administrators to nudge or influence the priority of a given process. You cannot directly tell the kernel to do x now instead of y but you can make what are effectively suggestions by manipulating the nice value.
- The kernel adds the nice value to the current priority value for the given process to determine its next time slot. When you increase the nice value of process P you are being “nicer” to the other processes by influencing the priority of P downwards so that the other processes receive greater precedence from the kernel.
- By default, the nice value will be 0. To reduce priority of PID 1234, you would use:
```
$ renice 20 1234
```
VIRT
- The total amount of Virtual_memory_and_the_MMU_in_Linux used by the process including: program code, data, shared libraries, pages that have been swapped, pages that have been mapped but not used.
RES
- Stands for resident size
- The non swapped physical memory the process has used
SHR
- The size of the process’s shared pages
S
- Status:
  - S for sleeping (idle)
  - R for running
  - D for disk sleep

`vmstat`

vmstat provides similar metrics to htop but tells you more about the memory state and the activities of the kernel in a single row.

The default output is a single line with the averages since boot. You can add a delay parameter (in secs) which will then output at that interval, allowing you to see memory usage in realtime, e.g:

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 4326768 334228 5050952    0    0     8    19   80   10  4  1 94  0  0
 0  0      0 4365520 334260 5054468    0    0     0   125 2140 3434  4  1 94  0  0
 1  0      0 4382400 334276 5068940    0    0     0    77 2102 3988  3  1 95  0  0
 1  0      0 4434000 334288 5052908    0    0     0    25 2859 4278  6  1 92  0  0
 0  0      0 4391576 334304 5086484    0    0     0   110 2899 6480  8  3 90  0  0

procs
- The number of runnable processes (r) and the number of blocked (b) processes
memory
- The core memory output distinguishing:
  - Total kbs swapped to disk
  - Total kbs free
  - Total kbs currently in buffer and not written
  - Total amount of virtual memory in the cache
swap
- Distinguishes amount of memory swapped in (si) to memory and swapped out (so) to disk
io
- Disk actions
- Amount of data read from harddisk (bi)
- Amount of data written to harddisk (bo)
system
- The number of times the kernel switches to kernel code
cpu
- Percentage of the different CPU behaviours:
  - Responding to user tasks (us)
  - Time that it is idle (id)

Files being used by active processes: `lsof`

lsof stands for list open files. It lists opened files and the processes using them. Without modifiers it outputs a huge amount of data. The best way to use it is to execute it against a specific PID. For example the below output gives me some useful info about which files VS Code is using:

System calls: `strace`

A system call is when a process requests a service from the kernel, for instance an I/O operation to memory. We can trace these system calls with strace.

CPU performance

We can use the uptime program to assess overall CPU performance in the form of a load average.

Load average is the number of active processes currently ready to run. It is an estimate of the number of processes that are capable of using the CPU at any given time.

Uptime gives you three load averages:

$ uptime
11:19:16 up 14 days,  3:53,  1 user,  load average: 0.84, 0.57, 0.50

The three numbers are load averages for the past 1 minute, 5 minutes and 15 minutes respectively.
A load average close to 0 is usually a good sign because it means that your processor isn’t being challenged and you are conserving power. Anything equal to or above 1 means that a single process is using the CPU nearly all the time. You can identify that process with htop and it will obviously be near to the top. (This is often caused by Chrome and Electron-based software.)

Memory status

We know that processes primarily interact with virtual memory in the form of pages which are then translated to physical blocks by the kernel via the Virtual_memory_and_the_MMU_in_Linux. There are several tools which provide windows onto this process.

System page size

We can view the overall system page size which is a representation of the amount of virtual memory available:

$ getconf PAGE_SIZE
4096

This will typically be the same for all Linux systems.

`free` : available physical memory

free displays the total amount of free and¬used physical and swap memory in the system, as well as the buffers and caches used by the kernel.

$ free
              total        used        free      shared  buff/cache   available
Mem:        16099420     5931512     5039344     2046460     5128564     7781904
Swap:        3145724           0     3145724