AI Hosting Sales for Nvidia GPU Server
Professional GPU VPS- RTX Pro 2000
- 30GB RAM
- 16 CPU Cores
- 240GB SSD
- 300Mbps Unmetered Bandwidth
- Once per 2 Weeks Backup
- OS: Linux / Windows 10/ Windows 11
- Dedicated GPU: Nvidia RTX Pro 2000
- CUDA Cores: 4,352
- Tensor Cores: 5th Gen
- GPU Memory: 16GB GDDR7
- FP32 Performance: 17 TFLOPS
Advanced GPU VPS- RTX Pro 4000
- 60GB RAM
- 24 CPU Cores
- 320GB SSD
- 500Mbps Unmetered Bandwidth
- Once per 2 Weeks Backup
- OS: Linux / Windows 10/ Windows 11
- Dedicated GPU: Nvidia RTX Pro 4000
- CUDA Cores: 8,960
- Tensor Cores: 280
- GPU Memory: 24GB GDDR7
- FP32 Performance: 34 TFLOPS
Advanced GPU VPS- RTX Pro 5000
- 60GB RAM
- 24 CPU Cores
- 320GB SSD
- 500Mbps Unmetered Bandwidth
- Once per 2 Weeks Backup
- OS: Linux / Windows 10/ Windows 11
- Dedicated GPU: Nvidia RTX Pro 5000
- CUDA Cores: 14,080
- Tensor Cores: 440
- GPU Memory: 48GB GDDR7
- FP32 Performance: 66.94 TFLOPS
Enterprise GPU VPS- RTX Pro 6000
- 90GB RAM
- 32 CPU Cores
- 400GB SSD
- 1000Mbps Unmetered Bandwidth
- Once per 2 Weeks Backup
- OS: Linux / Windows 10/ Windows 11
- Dedicated GPU: Nvidia RTX Pro 6000
- CUDA Cores: 24,064
- Tensor Cores: 852
- GPU Memory: 96GB GDDR7
- FP32 Performance: 126 TFLOPS
LLM Frameworks&Tools
LLM Hosting with Ollama — GPU Recommendation
| Model Name | Size (4-bit Quantization) | Recommended GPUs | Tokens/s |
|---|---|---|---|
| deepSeek-r1:7B | 4.7GB | T1000 < RTX3060 Ti < RTX4060 < A4000 < RTX5060 < V100 | 26.70-87.10 |
| deepSeek-r1:8B | 5.2GB | T1000 < RTX3060 Ti < RTX4060 < A4000 < RTX5060 < V100 | 21.51-87.03 |
| deepSeek-r1:14B | 9.0GB | A4000 < A5000 < V100 | 30.2-48.63 |
| deepSeek-r1:32B | 20GB | A5000 < RTX4090 < A100-40gb < RTX5090 | 24.21-45.51 |
| deepSeek-r1:70B | 43GB | A40 < A6000 < 2A100-40gb < A100-80gb < H100 < 2RTX5090 | 13.65-27.03 |
| deepseek-v2:236B | 133GB | 2A100-80gb < 2H100 | -- |
| llama3.2:1b | 1.3GB | P1000 < GTX1650 < GTX1660 < RTX2060 < T1000 < RTX3060 Ti < RTX4060 < RTX5060 | 28.09-100.10 |
| llama3.1:8b | 4.9GB | T1000 < RTX3060 Ti < RTX4060 < RTX5060 < A4000 < V100 | 21.51-84.07 |
| llama3:70b | 40GB | A40 < A6000 < 2A100-40gb < A100-80gb < H100 < 2RTX5090 | 13.15-26.85 |
| llama3.2-vision:90b | 55GB | 2A100-40gb < A100-80gb < H100 < 2RTX5090 | ~12-20 |
| llama3.1:405b | 243GB | 8A6000 < 4A100-80gb < 4*H100 | -- |
| gemma2:2b | 1.6GB | P1000 < GTX1650 < GTX1660 < RTX2060 | 19.46-38.42 |
| gemma3:4b | 3.3GB | GTX1650 < GTX1660 < RTX2060 < T1000 < RTX3060 Ti < RTX4060 < RTX5060 | 28.36-80.96 |
| gemma3n:e2b | 5.6GB | T1000 < RTX3060 Ti < RTX4060 < RTX5060 | 30.26-56.36 |
| gemma3n:e4b | 7.5GB | A4000 < A5000 < V100 < RTX4090 | 38.46-70.90 |
| gemma3:12b | 8.1GB | A4000 < A5000 < V100 < RTX4090 | 30.01-67.92 |
| gemma3:27b | 17GB | A5000 < RTX4090 < A100-40gb < H100 = RTX5090 | 28.79-47.33 |
| qwen3:14b | 9.3GB | A4000 < A5000 < V100 | 30.05-49.38 |
| qwen2.5:7b | 4.7GB | T1000 < RTX3060 Ti < RTX4060 < RTX5060 | 21.08-62.32 |
| qwen2.5:72b | 47GB | 2A100-40gb < A100-80gb < H100 < 2RTX5090 | 19.88-24.15 |
| qwen3:235b | 142GB | 4A100-40gb < 2H100 | ~10-20 |
| mistral:7b / openorca / lite / dolphin | 4.1–4.4GB | T1000 < RTX3060 < RTX4060 < RTX5060 | 23.79-73.17 |
| mistral-nemo:12b | 7.1GB | A4000 < V100 | 38.46-67.51 |
| mistral-small:22b / 24b | 13–14GB | A5000 < RTX4090 < RTX5090 | 37.07-65.07 |
| mistral-large:123b | 73GB | A100-80gb < H100 | ~30 |
LLM Hosting with vLLM + Hugging Face — GPU Recommendation
| Model Name | Size (16-bit Quantization) | Recommended GPU(s) | Concurrent Requests | Tokens/s |
|---|---|---|---|---|
| deepseek-ai/deepseek-coder-6.7b-instruct | ~13.4GB | A5000 < RTX4090 | 50 | 1375–4120 |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | ~16GB | 2A4000 < 2V100 < A5000 < RTX4090 | 50 | 1450–2769 |
| deepseek-ai/deepseek-coder-33b-instruct | ~66GB | A100-80gb < 2A100-40gb < 2A6000 < H100 | 50 | 570–1470 |
| deepseek-ai/DeepSeek-R1-Distill-Llama-70B | ~135GB | 4*A6000 | 50 | 466 |
| meta-llama/Llama-3.2-3B-Instruct | 6.2GB | A4000 < A5000 < V100 < RTX4090 | 50–300 | 1375–7214.10 |
| meta-llama/Llama-3.3-70B-Instruct / 3.1-70B / Meta-3-70B | 132GB | 4A100-40gb, 2A100-80gb, 2*H100 | 50 | ~295.52–990.61 |
| google/gemma-3-4b-it | 8.1GB | A4000 < A5000 < V100 < RTX4090 | 50 | 2014.88–7214.10 |
| google/gemma-2-9b-it | 18GB | A5000 < A6000 < RTX4090 | 50 | 951.23–1663.13 |
| google/gemma-3-12b-it | 23GB | A100-40gb < 2*A100-40gb < H100 | 50 | 477.49–4193.44 |
| google/gemma-3-27b-it | 51GB | 2*A100-40gb < A100-80gb < H100 | 50 | 1231.99–1990.61 |
| Qwen/Qwen2-VL-2B-Instruct | ~5GB | A4000 < V100 | 50 | ~3000 |
| Qwen/Qwen2.5-VL-3B-Instruct | ~7GB | A5000 < RTX4090 | 50 | 2714.88–6980.31 |
| Qwen/Qwen2.5-VL-7B-Instruct | ~15GB | A5000 < RTX4090 | 50 | 1333.92–4009.29 |
| Qwen/Qwen2.5-VL-32B-Instruct | ~65GB | 2*A100-40gb < H100 | 50 | 577.17–1481.62 |
| Qwen/Qwen2.5-VL-72B-Instruct-AWQ | 137GB | 4A100-40gb < 2H100 < 4*A6000 | 50 | 154.56–449.51 |
| mistralai/Pixtral-12B-2409 | ~25GB | A100-40gb < A6000 < 2*RTX4090 | 50 | 713.45–861.14 |
| mistralai/Mistral-Small-3.2-24B-Instruct-2506 | ~47GB | 2*A100-40gb < H100 | 50 | ~1200–2000 |
| mistralai/Pixtral-Large-Instruct-2411 | 292GB | 8*A6000 | 50 | ~466.32 |
Explanation:
Recommended GPUs: From left to right, performance from low to high
Tokens/s: from benchmark data.
Ollama GPU Benchmarks – Model Performance
vLLama GPU Benchmarks – Model Performance
What Clients Say about our AI hosting GPU Server?
Questions About AI Hosting Promotion
1. What is an AI hosting server, and how does it work?
2. Which platforms are supported?
3. What GPU memory is required for a 14B model?
4. What GPU memory is required for a 32B model?
5. What GPU memory is required for a 70B model?
6. When should I choose a multi-GPU plan?
7. Can I upgrade my server configuration later?
8. Can I run benchmarks on my own models before committing?
9. Is server maintenance included, or am I responsible for it?
10. Can I customize the server environment to fit my needs?
11. Can I use your servers for both inference and training tasks?
12. How many GPU servers can I buy with the AI hosting promotion?
13. What's the minimum duration for a GPU server order?
14. What's the meaning of recurring discount?
15. Can I get a discount for my existing GPU server?
16. Will the discount remain if I upgrade/downgrade the plan after the promotion?
17. What payment methods do you accept?
18. How long will it take to set up my server?
19. Can I get a free trial before payment?
Step 1: Submit a Free Trial Request
Select a plan, click 'Order Now,' and leave a note saying 'Need free trial.' Then, click 'Check Out' and proceed to the Order Confirm page. On this page, you must click 'Confirm' to complete the free trial request.
Step 2: Security Verification
This process takes about 30 minutes to 2 hours. Once verified, you will receive the server login details in the console and can start using it.If your trial request is not approved, you will be notified via email.
Custom Servers
Server Inquiry