QwQ-32B Hosting, Host Your QwQ-32B with Ollama



Blog

Partner

About Us

Cheap GPU Server

Hot deal! Get up to 53% OFF – As Low As $18.33/Month!>

Choose Your QwQ-32B Hosting Plans

GPU Mart offers best budget GPU servers for QwQ-32B. Cost-effective dedicated GPU servers are ideal for hosting your own LLMs online.

Half-Year Pay, Full-Year Deal

Advanced GPU Dedicated Server - A5000

$ 191.90/mo

45% OFF Recurring (Was $349.00)

1mo3mo12mo24mo

Order Now

128GB RAM
Dual 12-Core E5-2697v2
240GB SSD + 2TB SSD
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia Quadro RTX A5000
Microarchitecture: Ampere
CUDA Cores: 8192
Tensor Cores: 256
GPU Memory: 24GB GDDR6
FP32 Performance: 27.8 TFLOPS

Enterprise GPU Dedicated Server - RTX A6000

$ 409.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia Quadro RTX A6000
Microarchitecture: Ampere
CUDA Cores: 10,752
Tensor Cores: 336
GPU Memory: 48GB GDDR6
FP32 Performance: 38.71 TFLOPS

Enterprise GPU Dedicated Server - RTX 4090

$ 409.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: GeForce RTX 4090
Microarchitecture: Ada Lovelace
CUDA Cores: 16,384
Tensor Cores: 512
GPU Memory: 24 GB GDDR6X
FP32 Performance: 82.6 TFLOPS

Enterprise GPU Dedicated Server - A100

$ 639.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 40GB HBM2
FP32 Performance: 19.5 TFLOPS

Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc

Multi-GPU Dedicated Server - 2xA100

$ 1099.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
1Gbps

OS: Windows / Linux
GPU: Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 40GB HBM2
FP32 Performance: 19.5 TFLOPS
Free NVLink Included

A Powerful Dual-GPU Solution for Demanding AI Workloads, Large-Scale Inference, ML Training.etc. A cost-effective alternative to A100 80GB and H100, delivering exceptional performance at a competitive price.

New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

$ 1559.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 80GB HBM2e
FP32 Performance: 19.5 TFLOPS

Enterprise GPU Dedicated Server - H100

$ 2099.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia H100
Microarchitecture: Hopper
CUDA Cores: 14,592
Tensor Cores: 456
GPU Memory: 80GB HBM2e
FP32 Performance: 183TFLOPS

More GPU Server Instance Pricingarrow_circle_right

QwQ 32B Benchmark Performance

QwQ-32B is evaluated across a range of benchmarks designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. The results below highlight QwQ-32B’s performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.

How to Run QwQ 32B with Ollama or vLLM

VLLM is an optimized inference engine that delivers high-speed token generation and efficient memory management, making it ideal for large-scale AI applications. Ollama is a lightweight and user-friendly framework that simplifies running open-source LLMs on local machines. Make a selection based on your needs.

Order and Login GPU Server

Install Ollama or vLLM

Run QwQ-32B with Ollama or vLLM

Chat with QwQ-32B

Sample 1 - Run QwQ-32B with Ollama Command line

# install Ollama on Linux
curl -fsSL https://ollama.com/install.sh | sh

# on GPU dedicated server with RTX 4090 24GB
ollama run qwq
>>> How many r's are in the word "strawberry"?
……
Final Answer: There are three "R"s in "strawberry."

total duration:       55.10358031s
load duration:        54.175044ms
prompt eval count:    235 token(s)
prompt eval duration: 45ms
prompt eval rate:     5222.22 tokens/s
eval count:           1936 token(s)
eval duration:        54.949s
eval rate:            35.23 tokens/s

Sample 2 - Run QwQ-32B with vLLM

By default, vLLM uses the model file on HuggingFace, using Tensor type BF16, which is 4 times the size of the 4-bit quantization in the Ollama library, about 72GB. It is recommended to use a GPU card with 80GB memory.

# Prerequirements
# A100 80GB or H100 GPU Dedicated Server
uv pip install vllm
vllm serve Qwen/QwQ-32B --max-model-len 4096

Sample 3 - Run QwQ-32B with Hugging Face Transformers

Below are brief examples demonstrating how to use QwQ-32B via Hugging Face Transformers.

# Prerequirements
# A100 80GB or H100 GPU Dedicated Server
# uv pip install transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word \"strawberry\""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

FAQs of QwQ-32B Hosting

Here are some Frequently Asked Questions about QwQ-32B.

What is QwQ-32B?



QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

Who can use QwQ-32B?



QwQ-32B is open-weight in Hugging Face and ModelScope under the Apache 2.0 license and is accessible via Qwen Chat. You can deploy it locally or use the official online service.

How can I deploy QwQ-32B?



QwQ-32B can be deployed via Ollama, vLLM, or on-premise solutions.

QwQ-32B Hosting, Host Your QwQ-32B LLM Server

Choose Your QwQ-32B Hosting Plans

QwQ 32B Benchmark Performance

How to Run QwQ 32B with Ollama or vLLM

Sample 1 - Run QwQ-32B with Ollama Command line

FAQs of QwQ-32B Hosting

What is QwQ-32B?

Who can use QwQ-32B?

How can I deploy QwQ-32B?