In the process of using GPU servers, you may have encountered such a problem: nvidia-smi was fine yesterday, but suddenly an error occurred today.
$ nvidia-smi Failed to initialize NVML: Driver/library version mismatch
This problem occurs because the NVIDIA driver library version is inconsistent with the system kernel module. To fix it, we just need to remove the current nvidia driver and reinstall the correct NVIDIA driver version. Follow the steps below closely to achieve this.
Check the kernel version used by the graphics card driver
$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.129.03 Thu Oct 19 18:56:32 UTC 2023 GCC version: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
The NVRM version Kernel Module is 535.129.03, and the system kernel is 22.04.
Let's purge all NVIDIA packages including nvidia-common, run the following command.
$ sudo apt purge nvidia-* $ sudo apt purge libnvidia-*
If the relevant packages have been cleared, the output of the following command will be empty.
$ dpkg -l | grep -i nvidia
We can use Ubuntu's own driver management tool ubuntu-drivers devices to query the drivers recommended by the current version of Ubuntu.
$ ubuntu-drivers devices == /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:08.0/0000:03:00.0 == modalias : pci:v000010DEd000013BBsv000010DEsd00001098bc03sc00i00 vendor : NVIDIA Corporation model : GM107GL [Quadro K620] driver : nvidia-driver-535 - distro non-free recommended driver : nvidia-driver-450-server - distro non-free driver : nvidia-driver-390 - distro non-free driver : nvidia-driver-525 - distro non-free driver : nvidia-driver-525-server - distro non-free driver : nvidia-driver-470-server - distro non-free driver : nvidia-driver-470 - distro non-free driver : nvidia-driver-535-server - distro non-free driver : nvidia-driver-418-server - distro non-free driver : xserver-xorg-video-nouveau - distro free builtin
The output is as above, and the word "recommended" appears in one line, which means that the system recommends this driver, which is nvidia-driver-535.
To install the recommended driver, run the following command.
$ sudo apt install nvidia-driver-535
Also, you can execute the following command to automatically install the recommended version driver. At this time, the above recommended version driver will be automatically installed on the machine.
$ sudo ubuntu-drivers autoinstall
Reboot the machine for the changes to take effect.
$ sudo reboot
Finally, we restart the machine. Note here that you must restart the machine. After restarting, enter the following command to test the driver installation. As shown in the figure below, you can see that the recommended version of the driver was successfully installed.