Failed to initialize NVML: Driver/library version mismatch - Troubleshooting

Introduction

In the process of using GPU servers, you may have encountered such a problem: nvidia-smi was fine yesterday, but suddenly an error occurred today.

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

This problem occurs because the NVIDIA driver library version is inconsistent with the system kernel module. To fix it, we just need to remove the current nvidia driver and reinstall the correct NVIDIA driver version. Follow the steps below closely to achieve this.

How to fix the “failed to initialize nvml: driver/library version mismatch” error

Step 1: Check Kernel Version

Check the kernel version used by the graphics card driver

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.129.03  Thu Oct 19 18:56:32 UTC 2023
GCC version:  gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

The NVRM version Kernel Module is 535.129.03, and the system kernel is 22.04.

Step 2: Remove the Nvidia Driver

Let's purge all NVIDIA packages including nvidia-common, run the following command.

$ sudo apt purge nvidia-* 
$ sudo apt purge libnvidia-*

If the relevant packages have been cleared, the output of the following command will be empty.

$ dpkg -l | grep -i nvidia

Step 3: Find available driver versions

We can use Ubuntu's own driver management tool ubuntu-drivers devices to query the drivers recommended by the current version of Ubuntu.

$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:08.0/0000:03:00.0 ==
modalias : pci:v000010DEd000013BBsv000010DEsd00001098bc03sc00i00
vendor   : NVIDIA Corporation
model    : GM107GL [Quadro K620]
driver   : nvidia-driver-535 - distro non-free recommended
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-390 - distro non-free
driver   : nvidia-driver-525 - distro non-free
driver   : nvidia-driver-525-server - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-470 - distro non-free
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-418-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

The output is as above, and the word "recommended" appears in one line, which means that the system recommends this driver, which is nvidia-driver-535.

Step 4: Reinstall the Correct Driver

To install the recommended driver, run the following command.

$ sudo apt install nvidia-driver-535

Also, you can execute the following command to automatically install the recommended version driver. At this time, the above recommended version driver will be automatically installed on the machine.

$ sudo ubuntu-drivers autoinstall

Step 5: Restart the System

Reboot the machine for the changes to take effect.

$ sudo reboot

Step 6: Verify the Issue is Fixed

Finally, we restart the machine. Note here that you must restart the machine. After restarting, enter the following command to test the driver installation. As shown in the figure below, you can see that the recommended version of the driver was successfully installed.

$ nvidia-smi
nvidia smi k620