How to Stop NVIDIA Drivers from Breaking: A Production-Ready Guide for GPU Servers

A practical, step-by-step guide for AI developers and DevOps to prevent NVIDIA drivers from breaking after kernel updates on Ubuntu 20.04, 22.04, and 24.04. Learn to use `ubuntu-drivers autoinstall`, lock driver and kernel versions, use DKMS correctly, and build a stable GPU environment.

Running AI, machine learning, or rendering workloads on GPU VPS or GPU Bare Metal servers requires more than just raw power—it demands stability. One of the most common and costly operational headaches is the NVIDIA driver breaking after a system update, especially following a Linux kernel upgrade.

This guide moves beyond theory to provide concrete, actionable steps. We'll explain why NVIDIA drivers fail and walk you through building a rock-solid, production-ready GPU environment that withstands the chaos of system updates on Ubuntu 20.04, 22.04, and 24.04 LTS.


Why Is nvidia-smi Suddenly Failing? The Root Cause

If you've ever rebooted a GPU server after apt upgrade and found that nvidia-smi fails with an error like command not found, you've hit this problem. The issue isn't random; it's a dependency conflict.

NVIDIA drivers are not just user-space programs; they include kernel modules (nvidia.ko, nvidia_uvm.ko, etc.) that are deeply integrated with the Linux kernel. These modules are compiled specifically for the exact kernel version you are running.

When an automatic update (like Ubuntu's unattended-upgrades) installs a new kernel, your system reboots into it. The old NVIDIA kernel modules are now incompatible. DKMS (Dynamic Kernel Module Support) is supposed to automatically rebuild the modules for the new kernel, but this process can—and often does—fail silently. The result? The driver is broken, and your GPUs "disappear" from the system.


Building a Stable GPU Server: A Step-by-Step Guide

To prevent this, you must treat the Linux kernel and the NVIDIA driver as a single, immutable pair. Here’s how to enforce that stability.

Step 1: Start with a Long-Term Support (LTS) Foundation

Production systems need predictability. Bleeding-edge kernels introduce risk for no practical benefit in most GPU workloads.

  • OS: Always use an LTS release, such as Ubuntu 20.04, 22.04, or 24.04 LTS.
  • Kernel: Stick to the LTS kernels provided by the OS. Avoid mainline or experimental builds.

You can check your current kernel version with:

uname -r
# Example output on Ubuntu 22.04: 5.15.0-105-generic
# Example output on Ubuntu 24.04: 6.8.0-31-generic

This version string is your anchor.

Step 2: Install and Pin the NVIDIA Driver

Auto-updating NVIDIA drivers is a recipe for disaster on a production server. A new driver might have a bug, perform worse, or require a newer kernel you don't have.

First, ensure you have access to the latest stable NVIDIA drivers by adding the official graphics-drivers PPA. Then, use ubuntu-drivers autoinstall to let Ubuntu automatically select and install the most stable, tested driver for your hardware. This is the recommended starting point for new installations.

# 1. Add the official NVIDIA graphics-drivers PPA
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update

# 2. Install the recommended NVIDIA driver for your system
#    'autoinstall' selects the stable, tested driver for your GPU/kernel.
sudo ubuntu-drivers autoinstall

# 3. Install the NVIDIA settings panel (optional, but useful)
sudo apt install -y nvidia-settings

After the autoinstall completes, you need to identify which specific driver version was installed so you can pin it. You can find this by running nvidia-smi (look for "Driver Version") or dpkg -l | grep nvidia-driver. For example, if nvidia-smi shows driver version 550.78, the corresponding package name will likely be nvidia-driver-550.

Once you know the exact package name (e.g., nvidia-driver-550), immediately "pin" or "hold" it to prevent automatic upgrades:

# Example: Lock the nvidia-driver-550 package
sudo apt-mark hold nvidia-driver-550

This tells the package manager, "Do not touch this driver package, ever." Following this apt-mark hold process is the key to a stable NVIDIA driver Ubuntu installation and is critical for any Ubuntu 24.04 NVIDIA system.

Step 3: Pin the Linux Kernel (The Critical Step)

Since the driver is tied to the kernel, you must also lock the kernel itself. This prevents apt from installing a new kernel version that would break your pinned driver.

First, identify the exact package names for your currently running kernel and its headers:

# Find the installed kernel and headers packages
dpkg -l | grep -E 'linux-image|linux-headers' | grep "$(uname -r)"

You'll see output like linux-image-6.8.0-31-generic and linux-headers-6.8.0-31-generic. Now, hold them:

# Use the exact package names from the previous command
# This locks both the kernel and the headers needed by DKMS
sudo apt-mark hold linux-image-6.8.0-31-generic linux-headers-6.8.0-31-generic

Your kernel and driver are now locked into a stable, compatible pair.

Step 4: Verify the Installation and DKMS Status

After any installation, verify that everything is working correctly.

  1. Check DKMS Status: Ensure the NVIDIA module was built and loaded correctly.

    dkms status
    # Expected output on Ubuntu 24.04:
    # nvidia/550.78, 6.8.0-31-generic, x86_64: installed

    This shows the NVIDIA driver is successfully installed for your kernel.

  2. Check nvidia-smi: This is the ultimate test.

    nvidia-smi

    If this command displays your GPU(s) and driver information, your setup is stable and correct.

Step 5: Disable Secure Boot

Secure Boot is a UEFI firmware feature that requires all kernel modules to be cryptographically signed. While well-intentioned, it often blocks the third-party NVIDIA kernel modules built by DKMS. On a dedicated GPU server, it provides minimal security benefit while creating a significant operational risk.

Recommendation: Disable Secure Boot in the server's BIOS/UEFI settings. This simple change prevents a whole class of silent driver loading failures.


Advanced Management: Containerization and Golden Images

For larger-scale deployments like multi-tenant GPU VPS environments or ML platforms, proper GPU server management is essential.

  • Best Practice #1: Use the NVIDIA Container Toolkit: Don't install CUDA and ML frameworks directly on the host. Isolate workloads in containers. This allows you to run different CUDA versions for different projects without ever touching the stable host driver. It's the standard for modern AI/ML infrastructure.

  • Best Practice #2: Create a "Golden Image": Don't configure each server manually. Build one perfect, tested server with the pinned OS, kernel, and driver. Then, clone its disk image to deploy new servers. This "golden image" approach, a cornerstone of professional AI GPU hosting, ensures every machine in your fleet is identical and stable, dramatically reducing support tickets and downtime.


Component Ubuntu 22.04 LTS Ubuntu 24.04 LTS How to Enforce
Kernel 5.15 LTS 6.8 LTS apt-mark hold linux-image-...
NVIDIA Driver 535 (LTS Branch) 550 (LTS Branch) apt-mark hold nvidia-driver-...
Runtime NVIDIA Container Toolkit NVIDIA Container Toolkit Isolate workloads in Docker

Command Cheat Sheet & Common Mistakes

Commands to Remember:

  • Add NVIDIA PPA: sudo add-apt-repository ppa:graphics-drivers/ppa -y
  • Auto-install recommended driver: sudo ubuntu-drivers autoinstall
  • Install NVIDIA Settings: sudo apt install -y nvidia-settings
  • Check kernel: uname -r
  • Lock a package: sudo apt-mark hold <package-name>
  • Unlock a package: sudo apt-mark unhold <package-name>
  • Check DKMS: dkms status
  • Verify driver: nvidia-smi

Common Mistakes to Avoid:
❌ Running apt upgrade without pinned packages.
❌ Using non-LTS kernels or beta NVIDIA drivers in production.
❌ Forgetting to hold the linux-headers package along with the kernel image.
❌ Ignoring dkms status and assuming the driver works post-install.
❌ Leaving Secure Boot enabled on a dedicated GPU server.


Final Thoughts: Stability Is a Feature

For GPU VPS and GPU Bare Metal servers, stability is more valuable than having the "latest" version of everything. A well-managed GPU environment is predictable, reliable, and requires minimal intervention. The fastest GPU in the world is useless if its driver disappears when you need it most. By treating the kernel and driver as a single unit, you build a foundation for serious, production-grade work.