How to Run LLMs Locally with Ollama AI

Ollama is a powerful tool that is designed to help users run a wide range of language models. Get up and running with Llama 2, Mistral, and other large language models locally.

Introdcution of Ollama AI

What is Ollama AI?

Ollama AI is an open-source framework that allows you to run large language models (LLMs) locally on your computer. Using Ollama, users can easily personalize and create language models according to their preferences. If you’re a developer or a researcher, It helps you to use the power of AI without relying on cloud-based platforms.

Ollama also offers an efficient and convenient solution for running multiple types of language models. If you want control and privacy over the AI models then It’s perfect for you. Experience Ollama and get the benefit of the freedom of running language models on your terms. It is available on MacOS and Linux for download. For now, you can install Ollama on Windows via WSL2.

What does Ollama AI Do?

Ollama allows you to run open-source large language models, such as Llama 2, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. It optimizes setup and configuration details, including GPU usage.

Does Ollama use GPU?

Ollama is a fancy wrapper around llama. cpp that allows you to run large language models on your own hardware with your choice of model. But one of the standout features of OLLAMA is its ability to leverage GPU acceleration. This is a significant advantage, especially for tasks that require heavy computation. By utilizing the GPU, OLLAMA can speed up model inference by up to 2x compared to CPU-only setups.

5 Key Features of Ollama

Ease of Use: Ollama’s simple API makes it straightforward to load, run, and interact with LLMs. You can quickly get started with basic tasks without extensive coding knowledge.

Flexibility: Ollama offers a versatile platform for exploring various applications of LLMs. You can use it for text generation, language translation, creative writing, and more.

Powerful LLMs: Ollama includes pre-trained LLMs like Llama 2, renowned for its large size and capabilities. It also supports training custom LLMs tailored to your specific needs.

Local Execution: Ollama enables you to run LLMs locally on your device, enhancing privacy and control over your data. You don’t rely on cloud-based services and avoid potential latency issues.

Community Support: Ollama actively participates in the LLM community, providing documentation, tutorials, and open-source code to facilitate collaboration and knowledge sharing.

Overall, Ollama.ai stands as a valuable tool for researchers, developers, and anyone interested in exploring the potential of large language models without the complexities of cloud-based platforms. Its ease of use, flexibility, and powerful LLMs make it accessible to a wide range of users.

System Requirements

According to the official Ollama.ai documentation, the recommended system requirements for running Ollama are:

Operating System: Linux: Ubuntu 18.04 or later, macOS: macOS 11 Big Sur or later

RAM: 8GB for running 3B models, 16GB for running 7B models, 32GB for running 13B models

Disk Space: 12GB for installing Ollama and the base models, Additional space required for storing model data, depending on the models you use.

CPU: Any modern CPU with at least 4 cores is recommended, for running 13B models, a CPU with at least 8 cores is recommended.

GPU(Optional): A GPU is not required for running Ollama, but it can improve performance, especially for running larger models. If you have a GPU, you can use it to accelerate training of custom models.

In addition to the above, Ollama also requires a working internet connection to download the base models and install updates.

How to Install and Use Ollama AI?

Install Ollama AI

Please Note: As of February 2024, Ollami.ai only supports macOS and Linux. There are future plans for Windows support. To install Ollami.ai on Linux, simply run one command:

curl https://ollama.ai/install.sh | sh

Ollama Quickstart

To run and chat with Llama 2 uncensored:

ollama run llama2-uncensored
ollama run llama2-uncensored
>>> How to make the world free from war?
There is no simple answer to this question, but there are several steps that can be taken towards creating a more peaceful world. Firstly, governments should invest in diplomacy and conflict resolution instead of relying solely on military force for solving international conflicts. Secondly, education should focus on promoting tolerance, understanding and respect for different cultures and religions to reduce the likelihood of inter-group tensions and violence. Thirdly, there should be a concerted effort to address poverty and economic inequality which can lead to desperation and frustration that fuels conflict. Finally, individuals should strive to live in harmony with others by practicing nonviolence, respecting diversity and working together for the common good.

>>> Send a message (/? for help)

To run and chat with Mistral:

ollama run mistral --verbose
ollama run mistral --verbose

Note: You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

Pull a model

ollama pull llama2

Remove a model

ollama rm llama2

List models on your computer

ollama list

Start Ollama server (when you want to start ollama without running the desktop application)

ollama serve

Ollama help

For more information on how to use ollama, please refer to ollama help.

$ ollama -h
Large language model runner

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v,--version   Show version information

Use "ollama [command] --help" for more information about a command.

GPU Server Plans Recommendation

Some cost-effective dedicated GPU servers suitable for Ollama on GPUMart

Advanced GPU - RTX 3060 Ti

179.00/mo
1m3m12m24m
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2report
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: GeForce RTX 3060 Ti
  • Microarchitecture: Ampere
  • Max GPUs: 2report
  • CUDA Cores: 4864
  • Tensor Cores: 152
  • GPU Memory: 8GB GDDR6
  • FP32 Performance: 16.2 TFLOPSreport

Advanced GPU - V100

229.00/mo
1m3m12m24m
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2690v3report
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia V100
  • Microarchitecture: Volta
  • Max GPUs: 1
  • CUDA Cores: 5,120
  • Tensor Cores: 640
  • GPU Memory: 16GB HBM2
  • FP32 Performance: 14 TFLOPSreport
Spring Sale

Advanced GPU - A4000

167.2/mo
Save 40% (Was $279.00)
1m3m12m24m
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2report
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A4000
  • Microarchitecture: Ampere
  • Max GPUs: 2report
  • CUDA Cores: 6144
  • Tensor Cores: 192
  • GPU Memory: 16GB GDDR6
  • FP32 Performance: 19.2 TFLOPSreport
  • Good Choice for AI/Deep Learning, Data Science, CAD/CGI/DCC .etc

Advanced GPU - A5000

269.00/mo
1m3m12m24m
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2report
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A5000
  • Microarchitecture: Ampere
  • Max GPUs: 2report
  • CUDA Cores: 8192
  • Tensor Cores: 256
  • GPU Memory: 24GB GDDR6
  • FP32 Performance: 27.8 TFLOPSreport

Enterprise GPU - RTX A6000

409.00/mo
1m3m12m24m
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4report
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • Max GPUs: 1
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPSreport

Enterprise GPU - RTX 4090

409.00/mo
1m3m12m24m
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4report
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • Max GPUs: 1
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPSreport

Enterprise GPU - A40

439.00/mo
1m3m12m24m
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4report
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia A40
  • Microarchitecture: Ampere
  • Max GPUs: 1
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 37.48 TFLOPSreport
New Arrival

Enterprise GPU - A100

639.00/mo
1m3m12m24m
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4report
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbpsreport
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • Max GPUs: 1
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2e
  • FP32 Performance: 19.5 TFLOPSreport