How to Monitoring and Managing GPUs with Nvidia-SMI

The nvidia-smi command-line utility is the gateway to understanding and managing the powerhouse that GPUs represent in GPU servers. In this tutorial, we’ll explore using nvidia-smi to display the full name of NVIDIA GPUs, troubleshoot common issues, and even dive into some advanced features to get the most out of this utility.

What is Nvidia-SMI Used for?

Nvidia-smi provides a treasure trove of information ranging from GPU specifications and usage to temperature readings and power management. Let’s explore some of its use cases and highlight its importance in the realm of GPU management.

1. Monitoring GPU Performance

At the forefront of its capabilities, nvidia-smi excels in real-time monitoring of GPU performance. This includes tracking GPU utilization, which tells us how much of the GPU’s computational power the system is currently using.

Also, it monitors memory usage, an essential metric for understanding how much of the GPU’s Video RAM (VRAM) applications are occupying, which is crucial in workload management and optimization.

Moreover, nvidia-smi provides real-time temperature readings, ensuring that the GPU operates within safe thermal limits. This aspect is especially important in scenarios involving continuous, intensive GPU usage, as it helps in preventing thermal throttling and maintaining optimal performance.

2. GPU Hardware Configuration

Nvidia-smi isn’t just about monitoring, as it also plays a pivotal role in hardware configuration. It allows us to query various GPU attributes, such as clock speeds, power consumption, and supported features. This information is vital if we’re looking to optimize our systems for specific tasks, whether it’s for maximizing performance in computationally intensive workloads or ensuring energy efficiency in long-running tasks.

Furthermore, nvidia-smi provides the capability to adjust certain settings like power limits and fan speeds, offering a degree of control to us if we want to fine-tune our hardware for specific requirements or environmental conditions.

3. GPU Troubleshooting

When troubleshooting GPU issues, nvidia-smi is an invaluable asset. It offers detailed insights into the GPU’s status, which is critical in diagnosing these issues.

For instance, if a GPU is underperforming, nvidia-smi can help us identify whether the issue is related to overheating, excessive memory usage, or a bottleneck in GPU utilization. This tool also helps in identifying failing hardware components by reporting errors and irregularities in GPU performance.

Exploring Nvidia-smi and Its Options

-L or –list-gpus Option

This option lists all GPUs in the system:

$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 2060 SUPER (UUID: GPU-fb087aea-1cd3-0524-4f53-1e58a5da7a3c)

It’s particularly useful for quickly identifying the GPUs present, especially in systems with multiple GPUs.

–query-gpu Option

1. Query the VBIOS version of each device:

$ nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv
name, pci.bus_id, vbios_version
NVIDIA GeForce RTX 2060 SUPER, 00000000:03:00.0, 90.06.44.80.98
QueryDescription
timestampThe timestamp of where the query was made in format "YYYY/MM/DD HH:MM:SS.msec".
gpu_nameThe official product name of the GPU. This is an alphanumeric string. For all products.
gpu_bus_idPCI bus id as "domain:bus:device.function", in hex.
vbios_versionThe BIOS of the GPU board.

2. Query GPU metrics

This query is good for monitoring the hypervisor-side GPU metrics. This query will work for both ESXi and XenServer.

$ nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 5
timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2024/01/31 07:52:12.927, NVIDIA GeForce RTX 2060 SUPER, 00000000:03:00.0, 525.78.01, P0, 3, 3, 35, 0 %, 0 %, 8192 MiB, 7974 MiB, 0 MiB
2024/01/31 07:52:17.929, NVIDIA GeForce RTX 2060 SUPER, 00000000:03:00.0, 525.78.01, P0, 3, 3, 36, 0 %, 0 %, 8192 MiB, 7974 MiB, 0 MiB
2024/01/31 07:52:22.930, NVIDIA GeForce RTX 2060 SUPER, 00000000:03:00.0, 525.78.01, P0, 3, 3, 37, 0 %, 0 %, 8192 MiB, 7974 MiB, 0 MiB

You can get a complete list of the query arguments by issuing: nvidia-smi --help-query-gpu. When adding additional parameters to a query, ensure that no spaces are added between the queries options.

QueryDescription
timestampThe timestamp of where the query was made in format "YYYY/MM/DD HH:MM:SS.msec".
nameThe official product name of the GPU. This is an alphanumeric string. For all products.
pci.bus_idPCI bus id as "domain:bus:device.function", in hex.
driver_versionThe version of the installed NVIDIA display driver. This is an alphanumeric string.
pstateThe current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance).
pcie.link.gen.maxThe maximum PCI-E link generation possible with this GPU and system configuration. For example, if the GPU supports a higher PCIe generation than the system supports then this reports the system PCIe generation.
pcie.link.gen.currentThe current PCI-E link generation. These may be reduced when the GPU is not in use.
temperature.gpuCore GPU temperature. in degrees C.
utilization.gpuPercent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.
utilization.memoryPercent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.
memory.totalTotal installed GPU memory.
memory.freeTotal free memory.
memory.usedTotal memory allocated by active contexts.

Nvidia-smi Usage for Logging

Short-term Logging

Add the option "-f " to redirect the output to a file

Prepend "timeout -t " to run the query for and stop logging.

Ensure the your query granularity is appropriately sized for the use required:

Purposenvidia-smi "-l" valueintervaltimeout "-t" valueDuration
Fine-grain GPU behavior55 seconds60010 minutes
General GPU behavior601 minute36001 hour
Fine-grain GPU behavior36001 hour8640024 hours

Long-term Logging

Creating a Monitoring Script

Create a shell script to automate the creation of the log file with timestamp data added to the filename and query parameters.

!/bin/bash
while true; do
    /usr/bin/nvidia-smi >> /home/username/gpu_logs.txt
    sleep 600 # 10 minutes
done

Here, the script continuously logs the output of nvidia-smi to gpu_logs.txt every 10 minutes. Let’s save our Bash script as gpu_monitor.sh, and after doing so, we should remember to make it executable with the chmod command, then run the script:

$ chmod +x gpu_monitor.sh
$ ./gpu_monitor.sh

We can also set this script to run at startup or use a tool like screen or tmux to keep it running in the background.

Setting up a Cron Job

Alternatively, we can add a custom cron job to /var/spool/cron/crontabs to call the script at the intervals required. We can access the cron schedule for our user by running crontab -e in our terminal:

$ crontab -e

This opens the cron schedule in our default text editor. Then, we can schedule nvidia-smi to run at regular intervals. For example, we can run nvidia-smi every 10 minutes via the cron schedule:

*/3 * * * * /usr/bin/nvidia-smi >> /home/username/gpu_logs.txt

With this in the cron schedule, we append the output of nvidia-smi to a log file gpu_logs.txt in our user home directory every 3 minutes. We should remember to save the cron schedule and exit the editor. The cron job is now set up and will run at our specified intervals.

Additional commands used for clocks and power

Enable Persistence Mode

Any settings below for clocks and power get reset between program runs unless you enable persistence mode (PM) for the driver.

Also note that the nvidia-smi command runs much faster if PM mode is enabled.

nvidia-smi -pm 1 — Make clock, power and other settings persist across program runs / driver invocations

$ nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:03:00.0.
All done.

GPU Clocks

CommandDescription
nvidia-smi -ac View clocks supported
nvidia-smi –q –d SUPPORTED_CLOCKSSet one of supported clocks
nvidia-smi -q –d CLOCKView current clock
nvidia-smi --auto-boost-default=ENABLED -i 0Enable boosting GPU clocks (K80 and later)
nvidia-smi --racReset clocks back to base

GPU Power

CommandDescription
nvidia-smi –pl NSet power cap (maximum wattage the GPU will use)
nvidia-smi -pm 1Enable persistence mode
nvidia-smi stats -i -d pwrDrawCommand that provides continuous monitoring of detail stats such as power
nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr --format=csv -l 1Continuously provide time stamped power and clock
Adjusting Power Limits

Adjusting the power limit can help in balancing performance, energy consumption, and heat generation. First, we can view the current power limit:

$ nvidia-smi -q -d POWER

==============NVSMI LOG==============

Timestamp                                 : Wed Jan 31 08:58:41 2024
Driver Version                            : 525.78.01
CUDA Version                              : 12.0

Attached GPUs                             : 1
GPU 00000000:03:00.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 10.59 W
        Power Limit                       : 175.00 W
        Default Power Limit               : 175.00 W
        Enforced Power Limit              : 175.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 175.00 W
    Power Samples
        Duration                          : 0.14 sec
        Number of Samples                 : 8
        Max                               : 28.37 W
        Min                               : 10.30 W
        Avg                               : 13.28 W

This command shows the current power usage and the power management limits. Let’s now change the power limit:

$ nvidia-smi -pl 150
Power limit for GPU 00000000:03:00.0 was set to 150.00 W from 175.00 W.
All done.

We can replace 150 with our desired power limit in watts. Notably, the maximum and minimum power limits vary between different GPU models. In addition, while adjusting GPU settings, especially power limit, we must be cautious with overclocking. Pushing the GPU beyond its limits can lead to instability or damage.