Order and Login GPU Server
Choose Your QwQ-32B Hosting Plans
GPU Mart offers best budget GPU servers for QwQ-32B. Cost-effective dedicated GPU servers are ideal for hosting your own LLMs online.
Half-Year Pay, Full-Year Deal
Advanced GPU Dedicated Server - A5000
$ 191.90/mo
45% OFF Recurring (Was $349.00)
1mo3mo12mo24mo
Order Now- 128GB RAM
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia Quadro RTX A5000
- Microarchitecture: Ampere
- CUDA Cores: 8192
- Tensor Cores: 256
- GPU Memory: 24GB GDDR6
- FP32 Performance: 27.8 TFLOPS
Enterprise GPU Dedicated Server - RTX A6000
$ 409.00/mo
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia Quadro RTX A6000
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
Enterprise GPU Dedicated Server - RTX 4090
$ 409.00/mo
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: GeForce RTX 4090
- Microarchitecture: Ada Lovelace
- CUDA Cores: 16,384
- Tensor Cores: 512
- GPU Memory: 24 GB GDDR6X
- FP32 Performance: 82.6 TFLOPS
Enterprise GPU Dedicated Server - A100
$ 639.00/mo
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
- Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
Multi-GPU Dedicated Server - 2xA100
$ 1099.00/mo
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
- Free NVLink Included
- A Powerful Dual-GPU Solution for Demanding AI Workloads, Large-Scale Inference, ML Training.etc. A cost-effective alternative to A100 80GB and H100, delivering exceptional performance at a competitive price.
New Arrival
Enterprise GPU Dedicated Server - A100(80GB)
$ 1559.00/mo
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 80GB HBM2e
- FP32 Performance: 19.5 TFLOPS
Enterprise GPU Dedicated Server - H100
$ 2099.00/mo
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia H100
- Microarchitecture: Hopper
- CUDA Cores: 14,592
- Tensor Cores: 456
- GPU Memory: 80GB HBM2e
- FP32 Performance: 183TFLOPS
More GPU Server Instance Pricingarrow_circle_right
QwQ 32B Benchmark Performance
QwQ-32B is evaluated across a range of benchmarks designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. The results below highlight QwQ-32B’s performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.
How to Run QwQ 32B with Ollama or vLLM
VLLM is an optimized inference engine that delivers high-speed token generation and efficient memory management, making it ideal for large-scale AI applications. Ollama is a lightweight and user-friendly framework that simplifies running open-source LLMs on local machines. Make a selection based on your needs.
Install Ollama or vLLM
Run QwQ-32B with Ollama or vLLM
Chat with QwQ-32B
Sample 1 - Run QwQ-32B with Ollama Command line
# install Ollama on Linux curl -fsSL https://ollama.com/install.sh | sh # on GPU dedicated server with RTX 4090 24GB ollama run qwq >>> How many r's are in the word "strawberry"? …… Final Answer: There are three "R"s in "strawberry." total duration: 55.10358031s load duration: 54.175044ms prompt eval count: 235 token(s) prompt eval duration: 45ms prompt eval rate: 5222.22 tokens/s eval count: 1936 token(s) eval duration: 54.949s eval rate: 35.23 tokens/s
Sample 2 - Run QwQ-32B with vLLM
By default, vLLM uses the model file on HuggingFace, using Tensor type BF16, which is 4 times the size of the 4-bit quantization in the Ollama library, about 72GB. It is recommended to use a GPU card with 80GB memory.
# Prerequirements # A100 80GB or H100 GPU Dedicated Server uv pip install vllm vllm serve Qwen/QwQ-32B --max-model-len 4096
Sample 3 - Run QwQ-32B with Hugging Face Transformers
Below are brief examples demonstrating how to use QwQ-32B via Hugging Face Transformers.
# Prerequirements # A100 80GB or H100 GPU Dedicated Server # uv pip install transformers from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/QwQ-32B" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "How many r's are in the word \"strawberry\"" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response)
FAQs of QwQ-32B Hosting
Here are some Frequently Asked Questions about QwQ-32B.
What is QwQ-32B?
QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.
Who can use QwQ-32B?
QwQ-32B is open-weight in Hugging Face and ModelScope under the Apache 2.0 license and is accessible via Qwen Chat. You can deploy it locally or use the official online service.
How can I deploy QwQ-32B?
QwQ-32B can be deployed via Ollama, vLLM, or on-premise solutions.