How to install Oobabooga - Text Generation WebUI and Run LLaMA 2 Locally

In this blog post, I will show you how to easily install TextGen WebUI Oobabooga and run Llama 2 locally on our dedicated GPU server.

Introduction

LLaMA 2 is a family of generative text models that are optimized for assistant-like chat use cases or can be adapted for a variety of natural language generation tasks. It's a pre-trained and fine-tuned large language models (LLMs), ranging in scale from 7B to 70B parameters, from the AI group at Meta, the parent company of Facebook.

The Oobabooga Text-generation WebUI is an awesome open-source Web interface that allows you to run any open-source AI LLM models on your local computer for absolutely free! It provides a user-friendly interface to interact with these models and generate text, with features such as model switching, notebook mode, chat mode, and more. Next let's see how to install Oobabooga and use it to running Llama 2 locally or on a remote server.

Key Features of Oobabooga

- User-friendly interface: Oobabooga provides a simple and intuitive interface for generating text. Users can simply enter a prompt and select the desired LLM and generation settings.

- Support for multiple LLMs: Oobabooga supports a variety of LLMs, including GPT-3, GPT-J, and BLOOM. This allows users to choose the LLM that is best suited for their needs.

- Advanced generation settings: Oobabooga provides a number of advanced generation settings that allow users to control the quality and style of the generated text. These settings include temperature, top-p, and repetition penalty.

- Real-time feedback: Oobabooga provides real-time feedback on the generated text. This allows users to see how the text is changing as they adjust the generation settings.

- Code generation: Oobabooga can be used to generate code in a variety of programming languages. This makes it a valuable tool for developers and programmers.

System Requirements

Despite being the smallest 7B parameter model, it demands significant hardware resources for smooth operation. Keep in mind that GPU memory (VRAM) is crucial. You might be able to manage with lower-spec hardware, though performance was extremely slow.

The system requirements for installing Oobabooga are as follows:

- OS: Windows 10 or later, or Ubuntu 18.04 or later

- RAM: 8GB for 7B models, 16GB for 13B models, 32GB for 30B models, 64GB+ recommended

- CPU: Core 4+, Support AVX2 recommended

- GPU: Optional, if need GPU acceleration,16GB+ VRAM recommended

8 Steps to Install Oobabooga and Run LLaMA 2

There are different installation methods available, including one-click installers for Windows, Linux, and macOS, as well as manual installation using Conda. Detailed installation instructions can be found in the Text Generation Web UI repository. Below we will demonstrate step by step how to install it on an A5000 GPU Ubuntu Linux server.

Prerequisites

Before you begin this guide, you should have a regular, non-root user with sudo privileges and a basic firewall configured on your server. When you have an account available, log in as your non-root user to begin.

Step 1. Clone or Download Oobabooga Text Generation WebUI

First, let's download the Oobabooga installation code. There are two ways, one is to use git clone Oobabooga project code directly, the other is to download the Oobabooga zip package and then unzip it.

# way 1 - clone the Oobabooga git repo
$ git clone https://github.com/oobabooga/text-generation-webui.git

# way 2 - download Oobabooga zip package and unzip it
$ wget https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip
$ unzip main.zip

Step 2. Start Installing the TextGen WebUI Oobabooga

Enter the text-generation-webui directory (or text-generation-webui-main), and then execute the sudo start_linux.sh script. Note that different scripts need to be selected on different systems.

$ cd text-generation-webui/
$ sudo ./start_linux.sh

Step 3. Select your GPU Vendor When Asked

Since we are using NVIDIA's RTX A5000 graphics card, choose A here.

select your gpu

Step 4. Select NVIDIA CUDA Version

select cuda version

Step 5. Wait for the Automatic Installation to Complete

Next, we will enter the automatic installation process of Pytorch. All we have to do is wait.

automatic installation

The installation process takes about ten minutes. The output after the installation is completed is as follows:

finish installation

Step 6. Download Llama 2 Models

After the installation is complete, you need to download the Llama 2 models before you can actually use it. The models should be placed in the folder text-generation-webui/models. They are usually downloaded from Hugging Face. Use wget to download a model from Hugging Face.

$ cd text-generation-webui/models/
$ wget https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf?download=true -O ./llama-2-7b-chat.Q4_K_M.gguf

Note: GGUF models are a single file and should be placed directly into models.

It is also possible to download it via the command-line with download-model.py script.

# python download-model.py organization/model
$ python3 download-model.py TheBloke/Llama-2-7B-Chat-GGUF

Note: Run python download-model.py --help to see all the options.

The above are two ways to download the model using the command line. We can also download it in the TexGen WebUI. Another way to download, you can use the "Model" tab of the UI to download the model from Hugging Face automatically.

use gui download model

When the prompt content displays "Done", the Model download is completed. Click the refresh button on the right side of the Model selection bar in the picture below, and then click the drop-down arrow. We will find that the model we just downloaded is there.

use gui download model done

Step 7. Select a Model and Load It

Select the model llama-2-7b-chat.Q4_K_M.gguf, select llama.cpp in the Model loader, and click Load. You will see the model loading success message:

load llama 2 7b model - llama-2-7b-chat.Q4_K_M.gguf
Successfully loaded llama-2-7b-chat.Q4_K_M.gguf.
It seems to be an instruction-following model with template "Llama-v2". In the chat tab, instruct or chat-instruct modes should be used.

Step 8. Successfully Run Llama 2 Online, Enjoy It

Click the Chat tab in the upper left corner to enter the chat page, and then you can ask any questions you want.

llama 2 chat

Conclusion

This article shows how to install textgen webui Oobabooga to run Llama 2 locally or on a remote server. Oobabooga is a text-generation WebUI with a Chatbot where you can provide input prompts per your requirement. It is a different model that cannot be compared to any other Chatbot. The text-generation WebUI is more economical if you want to generate text using a chatbot model.

GPU Mart provides professional GPU hosting services optimized for high-performance computing projects. Here we recommend some Bare metal GPU server solutions suitable for running LLama 2 online. Choose the appropriate plan according to the model you want to use. For example, Llama 2 7B recommends using an 8GB graphics card, Llama 2 13B uses a 16GB or 24GB graphics card, and Llama 2 70B uses a 48GB and above graphics card. You can start your journey at any time and we will be happy to help you with any difficulties.

Advanced GPU - A4000

209.00/m
1m3m12m24m
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2report
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A4000
  • Microarchitecture: Ampere
  • Max GPUs: 2report
  • CUDA Cores: 6144
  • Tensor Cores: 192
  • GPU Memory: 16GB GDDR6
  • FP32 Performance: 19.2 TFLOPSreport

Advanced GPU - A5000

269.00/m
1m3m12m24m
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2report
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A5000
  • Microarchitecture: Ampere
  • Max GPUs: 2report
  • CUDA Cores: 8192
  • Tensor Cores: 256
  • GPU Memory: 24GB GDDR6
  • FP32 Performance: 27.8 TFLOPSreport

Enterprise GPU - A40

439.00/m
1m3m12m24m
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4report
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia A40
  • Microarchitecture: Ampere
  • Max GPUs: 1
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 37.48 TFLOPSreport

Advanced GPU - V100

229.00/m
1m3m12m24m
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2690v3report
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia V100
  • Microarchitecture: Volta
  • Max GPUs: 1
  • CUDA Cores: 5,120
  • Tensor Cores: 640
  • GPU Memory: 16GB HBM2
  • FP32 Performance: 14 TFLOPSreport

Enterprise GPU - RTX A6000

409.00/m
1m3m12m24m
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4report
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • Max GPUs: 1
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPSreport
New Arrival

Multi-GPU - 3xRTX A5000

539.00/m
1m3m12m24m
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4report
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: 3 x Quadro RTX A5000
  • Microarchitecture: Ampere
  • Max GPUs: 3report
  • CUDA Cores: 8192
  • Tensor Cores: 256
  • GPU Memory: 24GB GDDR6
  • FP32 Performance: 27.8 TFLOPSreport
New Arrival

Multi-GPU - 3xRTX A6000

899.00/m
1m3m12m24m
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4report
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: 3 x Quadro RTX A6000
  • Microarchitecture: Ampere
  • Max GPUs: 3report
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPSreport
New Arrival

Enterprise GPU - A100

639.00/m
1m3m12m24m
  • 256GB RAM
  • Dual 18-Core E5-2697v4report
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • Max GPUs: 1
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2e
  • FP32 Performance: 19.5 TFLOPSreport