Self-Hosted LLMs: GPU Selection, VRAM Requirements & Hosting Guide

Self-hosted LLMs allow organizations to run AI models on dedicated GPU infrastructure instead of relying on third-party APIs. Compared with token-based AI services, self-hosting offers greater control over text generation performance, data privacy, compliance, and long-term operating costs.

This guide compares GPU requirements, VRAM sizing, benchmark results across 14 GPU configurations, deployment frameworks, and LLM hosting architectures for running 7B–70B language models in production. Whether you are evaluating GPU VPS or dedicated GPU server options, the data here covers LLM GPU requirements, real inference benchmarks, and hosting cost comparisons to support your decision.

Why Teams Switch Away from Cloud AI APIs

Three pain points that surface once your self-hosted LLM text generation workloads move from prototype to production at scale.

Unpredictable Costs

AWS / GCP / OpenAI charge per token — the more you use, the higher the bill. High-traffic peaks trigger rate limits exactly when your users need the service most. There is no ceiling.

Unpredictable Latency

Shared GPU resources cause latency spikes you cannot control or predict. Production SLAs become impossible to guarantee when your AI server competes with thousands of other tenants on shared infrastructure.

Data Privacy Risk

Every prompt sent to a third-party API — customer data, proprietary code, internal knowledge bases — passes through infrastructure you do not control. For regulated industries, this is a hard blocker.

Why Self-Host? The Four Deployment Paths Compared

Understanding structural limits matters more than comparing spec sheets — especially for production LLM inference at scale.

Deployment Path	Examples	Key Advantage	Key Limitation	Best For
Cloud API	OpenAI / Anthropic / Gemini	Fast setup, no ops overhead	Data leaves your env; token costs scale; rate limits at peak	Prototyping, very low volume
Model Marketplace	Together AI / Fireworks	Wide model selection	Shared resources, unstable perf, limited customization	Mid-volume testing
On-Prem Data Center	Private rack	Full control, data sovereignty	$1M+ upfront, complex ops	Hyperscale enterprises only
Dedicated GPU Server (Recommended)	GPU Mart	Stable + fixed cost + private data + model flexibility	Requires basic Linux ops	SMB production, AI startups

GPU Mart Dedicated GPU: Four Core Advantages

Exclusive GPU Resources

No resource preemption, no performance jitter. 100% compute is yours. No noisy-neighbor interference unlike shared cloud pools.

100% Data Private

Data never leaves your LLM server. Meets enterprise compliance and GDPR. No third-party API exposure for sensitive prompts, code, or documents.

30–70% Lower Latency

A dedicated AI server eliminates shared-GPU latency spikes. First-token response is consistently fast — no cold-start delays, no queue contention.

50%+ Lower Long-Term Cost

LLM hosting on flat-rate dedicated hardware replaces per-token billing. LLM VPS hosting starts from $95/mo — predictable budgets, zero surprise invoices.

30–70%Lower latency vs shared cloud

50%+Long-term cost reduction

99.9%Uptime SLA

<5 minSupport response, 24/7

What Hardware Determines LLM Performance?

Understanding LLM GPU requirements — VRAM, memory bandwidth, and precision support — is essential before selecting the best GPU for LLM hosting. These three variables determine what model sizes you can run and how fast tokens generate.

1 — VRAM: LLM GPU Requirements by Model Size

VRAM is the top LLM GPU requirement — the hard ceiling for any self-hosted LLM. If the model doesn't fit in VRAM, no other spec matters. The VRAM requirements below use a 14B parameter model as reference — scale proportionally for larger models.

Note on Q4_K_M (Ollama default): to avoid the significant quality degradation of pure 4-bit quantization, Ollama uses a "mixed precision" approach — core weights at 6-bit, less critical weights at 4-bit. This keeps quality loss minimal while keeping a 35B model at approximately 22–23 GB VRAM.

Quantization Format	Bytes / Param	14B Model VRAM	Notes
FP16 / BF16 (full precision)	2 bytes	~28 GB	Highest quality, no precision loss
FP8	1 byte	~14 GB	Near-FP16 quality; requires native GPU support
INT8	1 byte	~14 GB	Slight quality loss; broad compatibility
Q4_K_M (Ollama GGUF default)	~0.55 bytes	~7–8 GB	Mixed precision (6-bit core + 4-bit other); 35B fits ~22–23 GB
INT4 / AWQ / GPTQ	0.5 bytes	~7 GB	Heavy compression; good for constrained setups

KV Cache VRAM Requirements: Context Length & Concurrency

Beyond model weights, inference requires KV Cache memory for every active request. For Qwen2.5-14B at FP16 (24 layers, 8 KV heads, 64 dims per head):

KV Cache / Token = 2 × L × H_kv × D_head × B = 2 × 24 × 8 × 64 × 2 bytes ≈ 48 KB / Token

Concurrency	Context Length (Tokens)	KV Cache VRAM	Typical Use Case
1	1,024	~48 MB	Single-user short chat
8	1,024	~384 MB	Small team short chat
32	1,024	~1.5 GB	Medium concurrency short chat
1	32,768 (32K)	~1.5 GB	Single-user long document
1	131,072 (128K)	~6 GB	Single-user extended context

VRAM requirements formula: Total VRAM = Model Weights + KV Cache + 20–30% headroom. Always size for peak concurrency, not just model weights. This is the most common cause of OOM errors in self-hosted LLM deployments.

2 — Memory Bandwidth: How Fast Tokens Generate

Generating each token requires loading the full model weight set from VRAM into Tensor Cores. Memory bandwidth — a key LLM GPU requirement — is the hard ceiling on tok/s, not TFLOPS. A100-40G at 1,555 GB/s on a 20B FP16 model: theoretical max ≈ 38 tok/s for a single request.

Workload Profile	Primary VRAM Usage	Bottleneck	Recommended Strategy
Low concurrency / short context	Weights dominant	Memory bandwidth	High-bandwidth GPU: H100, RTX 5090
High concurrency / long context	KV Cache dominant	Compute (Tensor Core queue)	Large-VRAM GPU + batching
Offline batch processing	Weights + large-batch KV Cache	Bandwidth & compute both	H100 / A100 + vLLM continuous batching

3 — Tensor Core Precision: Where Blackwell Pulls Ahead

On GPUs with native FP4/FP8 support: FP4 compute is 2× FP8, and FP8 is 2× FP16. FP8-quantized model weights also use half the VRAM of FP16, and FP4 uses a quarter — freeing more VRAM for context and concurrency. Only Blackwell GPUs can run LLMs at FP4 precision, achieving maximum possible throughput.

Precision	RTX Pro 6000 (Blackwell)	H100-80G	A100-80G	Notes
TF32	234 TFLOPS	—	312 TFLOPS	Training common precision
FP16 / BF16	1,000 TFLOPS	989 TFLOPS	312 TFLOPS	Main inference precision
INT8 / FP8	2,000 TFLOPS	~1,979 TFLOPS	No native FP8	FP8: 2× throughput, half VRAM vs FP16
FP4 / INT4	4,000 TFLOPS	Not supported	Not supported	Blackwell-exclusive: 4× vs FP16, quarter VRAM

Best GPUs for LLM Workloads

Finding the best GPU for LLM workloads depends on model size, concurrency, and budget. All specs validated on GPU Mart LLM hosting infrastructure. Source: NVIDIA official specification documents. Pricing: gpu-mart.com/pricing.

Tensor Cores Dense: FP16 dense compute (TFLOPS). AI TOPS: max compute at lowest supported precision (sparse). Precision: natively supported precisions.

GPU Model	VRAM	Mem BW	Tensor Cores Dense (FP16)	AI TOPS (max prec.)	Precision	Max Model	Typical Use Case	Price / mo	Order
Data Center — Volta
V100-SXM2-16G	16 GB	900 GB/s	21.2 TFLOPS	1,248 TOPS	FP16 INT8	~7B FP16	Legacy production, FP16 inference	$131.56 56% OFF	Order Now
Data Center — Ampere
A40-48G	48 GB	696 GB/s	149.7 TFLOPS	—	FP16/BF16 INT8/INT4	~27B FP8	Offline batch, doc analysis	$296 dedicated	Order Now
A100 - SXM4 - 40GB	40 GB	1,555 GB/s	312 TFLOPS	2,496 TOPS	FP16/BF16 INT8/TF32	~14B FP16	Mid-size inference/training	$360 55% OFF	Order Now
A100 - 80GB - PCIe	80 GB	1,935 GB/s	312 TFLOPS	2,496 TOPS	FP16/BF16 INT8/TF32	~40B FP16	Large model production	$1,559 dedicated	Order Now
Data Center — Hopper
H100 - 80G - PCIe	80 GB	2,000 GB/s	756 TFLOPS	2,026 TOPS	FP16/BF16 FP8/INT8	~80B FP8	High-concurrency production API	$2,099 dedicated	Order Now
Professional Workstation — Ampere
RTX A4000-16G	16 GB	448 GB/s	76.7 TFLOPS	153.4 TOPS	FP16 INT8	~7B FP16 / 14B INT4	Dev/test, single-user	$120 VPS · $140 ded.	Order Now
RTX A5000-24G	24 GB	768 GB/s	111.1 TFLOPS	222.2 TOPS	FP16 INT8	~13B FP16 / 27B INT4	Mid-model dev/test	$269 dedicated	Order Now
RTX A6000-48G	48 GB	768 GB/s	154.9 TFLOPS	309.7 TOPS	FP16 INT8	~24B FP16 / 48B INT4	Mid-large private deploy	$409 dedicated	Order Now
Consumer — Ada Lovelace
RTX 4090-24G	24 GB	1,008 GB/s	165 TFLOPS	1,321 TOPS	FP16/BF16 FP8/INT4/TF32	~13B FP16 / 27B INT4	High-speed small model	$409 dedicated	Order Now
Consumer — Blackwell
RTX 5090-32G	32 GB	1,792 GB/s	419 TFLOPS	3,352 TOPS	FP16/BF16 FP8/FP4/INT4	~35B INT4	Fast inference, small team	$399 VPS · $479 ded.	Order Now
Professional — Blackwell (Recommended)
ⓘ GPU Mart GPU VPS uses KVM PCI GPU Passthrough — exclusive GPU, no shared resources
RTX Pro 2000-16G	16 GB	288 GB/s	136 TFLOPS	545 TOPS	FP16/BF16 FP8/FP4/INT4	~7B FP16	Lightweight, single user	$99 VPS	Order Now
RTX Pro 4000-24G	24 GB	672 GB/s	147 TFLOPS	1,178 TOPS	FP16/BF16 FP8/FP4	~27B INT4	Small-team mid model	$159 VPS	Order Now
RTX Pro 5000-48G	48 GB	1,344 GB/s	268 TFLOPS	—	FP16/BF16 FP8/FP4	~35B INT4 (128K)	Agent, RAG, multi-model	$269 VPS	Order Now
RTX Pro 6000-96G	96 GB	1,597 GB/s	500 TFLOPS	4,000 TOPS	FP16/BF16 FP8/FP4 INT4	~122B INT4	Large model single-card	$479 VPS	Order Now

Quick-Decision: Best GPU for LLM by Use Case

Your Scenario	VRAM	Recommended GPU	Reason	Monthly Price
Personal dev / testing, 7B–14B	16–24 GB	RTX A4000 / Pro 2000 / Pro 4000	Low-cost entry; Pro 2000 VPS $95/mo, Pro 4000 VPS $159/mo	from $95/mo
Small-team production, 14B–27B	24–48 GB	RTX A6000 / Pro 5000-48G / A40	48GB large VRAM; A6000 $409/mo, Pro 5000 $269/mo, A40 $296/mo	from $269/mo
High-speed inference, 7B–14B high concurrency	24–32 GB	RTX 5090 / RTX Pro 5000	Blackwell extreme bandwidth; 5090 VPS $399/mo	from $399/mo
Enterprise RAG / Agent, 27B–35B	48 GB	RTX Pro 5000-48G / A100-40G	Pro 5000 VPS $269/mo; A100-40G $360/mo (55% OFF)	from $269/mo
70B–120B quantized, single card	80–96 GB	A100-80G / H100 / Pro 6000-96G	A100-80G $1,559/mo; H100 $2,099/mo; Pro 6000 VPS $479/mo	from $479/mo
100+ concurrent users production API	80 GB+	H100-80G	Hopper + FP8, industry standard; $2,099/mo	$2,099/mo

All plans: flat-rate monthly billing, unmetered bandwidth, no hidden fees. Pricing subject to change — verify at gpu-mart.com/pricing.

Real Inference Benchmark Data

vLLM framework · Input 1,024 tokens + Output 512 tokens · GPU Mart production hardware · Optimized for text generation and LLM inference

Mean TTFT = Time to First Token (lower is better) · P50 TTFT = median first-token latency · Mean E2EL = end-to-end latency
Per-user Output Tokens/s: Average output token generation speed per request under the specified concurrency level. Reflects single-stream generation performance in a multi-user serving environment.
Aggregate Output Tokens/s: Total output token generation rate across all concurrent requests. Measures overall serving capacity excluding input tokens.

Most common enterprise deployment model · All concurrency levels shown

GPU	Concurrency	Mean TTFT (s)	P50 TTFT (s)	Per-user Output Tokens/s	Aggregate Output Tokens/s	Mean E2EL (s)
A40-48G	1	1.722	1.777	5.16	5.16	99.15
A40-48G	8	6.601	6.186	4.88	39.06	95.03
A40-48G	32	12.758	10.904	3.05	97.52	158.43
A100-80G	1	0.630	0.647	20.51	20.51	24.96
A100-80G	8	1.060	0.526	20.67	165.36	23.61
A100-80G	32	2.612	1.060	11.01	352.46	43.34
A6000-48G	1	0.271	0.288	23.15	23.15	22.12
A6000-48G	8	0.947	0.919	19.35	154.78	25.54
A6000-48G	32	2.227	1.687	12.70	406.43	37.41
H100-80G	1	0.199	0.236	40.07	40.07	12.78
H100-80G	8	0.954	0.710	34.81	278.48	14.70
H100-80G	32	1.086	0.370	24.26	776.46	19.36
RTX 5090-32G	1	0.164	0.183	40.10	40.10	12.77
RTX 5090-32G	8	0.571	0.549	31.98	255.84	15.45
RTX 5090-32G	32	1.044	0.394	22.20	710.53	19.25
RTX Pro 5000-48G	1	0.164	0.183	40.55	40.10	12.77
RTX Pro 5000-48G	8	0.571	0.549	34.34	255.84	15.45
RTX Pro 5000-48G	32	1.044	0.394	28.07	710.53	19.25
RTX Pro 6000-96G	1	1.103	0.141	41.7	41.7	12.272
RTX Pro 6000-96G	8	0.395	0.239	40.1	320.6	12.302
RTX Pro 6000-96G	32	1.099	0.377	27.2	871.0	17.435

For 14B FP16 models, the best GPUs for LLM inference are H100, RTX 5090, and RTX Pro 5000 — all achieving ~40 tok/s single-user. A40 bottlenecks at bandwidth (~5 tok/s). A6000-48G delivers best price-to-throughput for production LLM hosting: $409/mo for 406 tok/s at 32-concurrency.

Blackwell FP8 native acceleration vs Ampere · All concurrency levels shown

GPU	Concurrency	Mean TTFT (s)	P50 TTFT (s)	Per-user Output Tokens/s	Aggregate Output Tokens/s	Mean E2EL (s)
A40-48G	1	0.942	1.083	25.06	25.06	20.43
A40-48G	8	1.744	1.232	15.74	125.93	31.52
A40-48G	32	5.127	1.482	6.98	223.20	70.00
A6000-48G	1	0.225	0.208	69.42	69.42	7.37
A6000-48G	8	0.420	0.254	54.70	437.62	9.03
A6000-48G	32	0.768	0.424	31.19	997.96	15.45
H100-80G	1	0.120	0.103	122.22	122.22	4.19
H100-80G	8	0.214	0.150	87.04	696.33	4.19
H100-80G	32	0.554	0.348	45.96	1,470.83	4.19
RTX 5090-32G	1	0.095	0.080	144.41	144.41	3.55
RTX 5090-32G	8	0.164	0.121	126.16	1,009.26	3.91
RTX 5090-32G	32	0.404	0.341	90.30	2,889.66	5.21
RTX Pro 5000-48G	1	0.106	0.087	118.62	116.01	4.41
RTX Pro 5000-48G	8	0.175	0.111	108.11	804.73	4.90
RTX Pro 5000-48G	32	0.397	0.279	81.63	2,269.42	6.66
RTX Pro 6000-96G	1	0.034	0.033	133.8	133.8	3.826
RTX Pro 6000-96G	8	0.061	0.050	120.1	960.5	4.107
RTX Pro 6000-96G	32	0.777	0.819	78.2	2,502.7	6.021

For 8B-FP8 models, the best GPU for LLM inference is the RTX 5090 at 144 tok/s — nearly matching H100 (122 tok/s) at less than 1/5 the cost. RTX Pro 5000 achieves 118 tok/s with 48 GB VRAM. A6000 manages only 69 tok/s — this is Blackwell FP8 native advantage in action.

A100 vs H100 on mid-large models · FP8 quantization

GPU	Concurrency	Mean TTFT (s)	P50 TTFT (s)	Per-user Output Tokens/s	Aggregate Output Tokens/s	Mean E2EL (s)
A100-80G	1	1.366	1.323	15.75	15.75	32.50
A100-80G	8	4.281	3.687	13.22	105.76	37.54
A100-80G	32	7.480	7.687	7.10	227.11	69.36
H100-80G	1	0.347	0.308	37.79	37.79	13.55
H100-80G	8	1.438	1.520	32.16	257.26	15.27
H100-80G	32	2.914	2.892	15.61	499.39	30.55
RTX Pro 6000-96G	1	0.266	0.169	45.8	45.8	11.183
RTX Pro 6000-96G	8	1.911	2.269	35.3	282.7	13.962
RTX Pro 6000-96G	32	4.255	4.345	21.6	692.6	22.094

H100 is 2.4× faster than A100 on 27B-FP8 (37.79 vs 15.75 tok/s). A100 hits 7.5-second TTFT at 32 concurrency — unacceptable for real-time API. For 27B+ FP8 models in production, H100 is the only correct single-GPU choice.

H100-80G only · High-concurrency limits test

GPU	Concurrency	Mean TTFT (s)	P50 TTFT (s)	Per-user Output Tokens/s	Aggregate Output Tokens/s	Mean TPOT (ms)	Mean E2EL (s)
H100-80G	1	1.077	1.258	15.18	15.18	63.90	33.73
H100-80G	8	4.945	4.822	11.91	95.24	71.23	41.35
H100-80G	32	70.823	78.252	4.10	131.25	86.11	114.83
RTX Pro 6000-96G	1	0.349	0.331	21.2	21.2	—	24.163
RTX Pro 6000-96G	8	2.801	3.423	17.1	137.2	—	28.816
RTX Pro 6000-96G	32	17.880	6.921	8.7	276.8	—	56.087

At concurrency 32, TTFT spikes to 70 seconds — 31B model on a single H100 hits severe queue buildup above 8 concurrent requests. For production, cap concurrency at 4–8 per card, or use multi-GPU deployment.

Ready to run these benchmarks on your own workload?

GPU VPS from $21/mo · Dedicated GPU server from $49/mo · No long-term commitment required

View All GPU Plans

Ollama Single-User Benchmarks

llama.cpp backend · Q4_K_M quantization · Ideal for LLM VPS hosting in single-user and dev environments · No KV Cache pre-allocation · Input 1,024 tokens + Output 512 tokens · Single user · Average of 10 requests

VRAM Usage & Maximum Supported Models

GPU (VRAM)	Model	VRAM Used	Context	Notes
RTX Pro 6000 (96 GB) — 120B-class models
RTX Pro 6000	qwen3.5:122b	95 GB	262,144 (256K)	Highest VRAM usage
RTX Pro 6000	gpt-oss:120b	70 GB	131,072 (128K)
RTX Pro 6000	qwen3-coder-next:latest	61 GB	262,144 (256K)
RTX Pro 5000 (48 GB), A6000, A40 — 35B primary workloads
RTX Pro 5000	glm-4.7-flash:latest	40 GB	202,752 (~200K)
RTX Pro 5000	qwen3.5:35b	34 GB	262,144 (256K)	35B — full 256K context
RTX 5090 (32 GB) — 35B with context reduction
RTX 5090	gemma3:27b	30 GB	131,072 (128K)
RTX 5090	qwen3.5:35b	30 GB	131,072 (128K)	35B — 128K context
RTX 5090	qwen3.5:35b	27 GB	32,768 (32K)	35B — 32K context
RTX 5090	gemma4:31b	27 GB	32,768 (32K)
RTX 5090	qwen3.5:35b	26 GB	16,384 / 8,192	35B — 16K / 8K context
RTX Pro 4000 (24 GB), A5000, RTX 4090
RTX Pro 4000	qwen3.6:27b	24 GB	32,768 (32K)
RTX Pro 4000	gemma4:26b	20 GB	32,768 (32K)
RTX Pro 4000	deepseek-v2:16b	19 GB	32,768 (32K)
RTX Pro 4000	qwen3.5:4b	17 GB	262,144 (256K)	Small param, ultra-long context
RTX Pro 2000 (16 GB), A4000, P100, V100
RTX Pro 2000	gpt-oss:20b	14 GB	32,768 (32K)
RTX Pro 2000	qwen3.5:9b	9.9 GB	32,768 (32K)	Lowest VRAM usage

Context reduction tip: set num_ctx to reduce VRAM and run 35B models on 32 GB cards:

curl http://localhost:11434/api/generate -d '{"model":"qwen3.5:35b","prompt":"hello","options":{"num_ctx":32768}}'

Generation Speed by GPU

gemma4:26b — 20 GB, 32K context

GPU	Avg TTFT (s) ↓	P50 TTFT (s)	Avg Gen Speed tok/s ↑	Avg E2E Time (s) ↓
RTX 5090	4.933	4.917	149.77	4.93
RTX Pro 6000	4.965	4.897	140.41	4.97
RTX Pro 5000	5.045	5.006	136.79	5.05
RTX Pro 4000	6.114	6.079	107.56	6.11
A6000	6.482	6.438	102.30	6.48
A5000	6.830	6.800	93.91	6.83
A40	7.048	7.008	92.24	7.05

gpt-oss:20b — 14 GB, 32K context

GPU	Avg TTFT (s) ↓	P50 TTFT (s)	Avg Gen Speed tok/s ↑	Avg E2E Time (s) ↓
RTX 5090	0.653	0.619	214.90	3.67
RTX Pro 6000	0.556	0.558	202.25	3.62
RTX Pro 5000	0.613	0.597	178.84	3.98
A6000	0.642	0.638	124.66	5.28
RTX Pro 4000	0.553	0.555	117.60	5.37
A5000	0.664	0.620	109.00	5.85
A40	0.646	0.645	96.45	6.60
RTX Pro 2000	0.541	0.532	61.69	9.24

qwen3.5:9b — 9.9 GB, 32K context

GPU	Avg TTFT (s) ↓	P50 TTFT (s)	Avg Gen Speed tok/s ↑	Avg E2E Time (s) ↓
RTX 5090	0.618	0.597	140.45	4.97
RTX Pro 6000	0.597	0.589	130.04	5.16
RTX Pro 5000	0.576	0.579	123.13	5.31
A6000	0.747	0.732	80.95	7.73
A40	0.747	0.732	80.95	7.73
RTX Pro 4000	0.600	0.603	78.59	7.75
A5000	0.757	0.738	70.50	8.60
RTX Pro 2000	0.746	0.749	42.13	13.45

RTX Pro 5000 (48G Blackwell) achieves 178 tok/s on gpt-oss:20b and 123 tok/s on qwen3.5:9b — approaching RTX 5090 while offering 48 GB vs 32 GB VRAM. Best overall value for single-user LLM hosting when both speed and model capacity matter.

Inference Framework Selection

Framework choice affects throughput and latency as much as GPU selection — pick the right one for your LLM inference use case.

Dimension	Ollama	vLLM	SGLang	TensorRT-LLM
Design goal	Local single-user	Production high-concurrency API	High-throughput structured inference	Max throughput (NVIDIA only)
Deployment complexity	Simplest	Medium	Medium	Very high (requires compilation)
Cold start time	Seconds	~62 sec	~58 sec	~28 min
Single-user TTFT	~65 ms	~10.7 ms	~11–12 ms	~10.5 ms
High-concurrency throughput	Low (~484 tok/s)	High	Higher (+17–29% vs vLLM)	Highest
VRAM usage	Low (INT4, no pre-alloc)	High (pre-allocated)	High (pre-allocated)	High
FP8/FP4 support	Partial	Full	Full	Full
OpenAI-compatible API	Yes	Yes	Yes	Yes
GPU support	NVIDIA + Apple M	NVIDIA + AMD + TPU	NVIDIA + AMD	NVIDIA only

Ollama

Dev / Test

One-command setup, model auto-download. Best for dev, single-user text generation, rapid evaluation.

vLLM

Production First Choice

General production API, widest model compatibility (incl. AMD/TPU). Continuous batching default on.

SGLang

RAG / Agent

RadixAttention prefix caching delivers +29% text generation throughput vs vLLM. Ideal for RAG, multi-turn, DeepSeek-class models.

TensorRT-LLM

Advanced Only

Only if: max throughput on pure NVIDIA stack AND you accept 28-min compilation per model version.

Deployment Optimization Tips

Reduce VRAM (Fix OOM)

Enable vLLM continuous batching (default on): dynamically merges requests, lowers peak VRAM
Quantize to INT4/INT8 via AWQ/GPTQ: 2–4× VRAM reduction, minimal quality loss
Reduce max_tokens and batch_size: cuts peak KV Cache usage
In Ollama: set num_ctx explicitly to allocate only what you need
Last resort: CPU offloading (latency penalty; only for extreme VRAM shortage)

Improve Throughput & Speed

Choose high-bandwidth GPUs: bandwidth caps tok/s. H100 > RTX 5090 > A100 > A6000
Use FP8 models + FP8-capable GPU: doubles throughput at same VRAM — also reduces weight memory footprint so more context and concurrency fits
Enable Speculative Decoding: small draft model assists large model, reduces TPOT
Multi-GPU tensor parallelism: vLLM/SGLang support --tensor-parallel-size

Auxiliary Models for <16 GB GPUs

Model Type	Example Model	Typical VRAM	Use Case
Embedding	Qwen3-Embedding-8B	~10 GB	RAG vector encoding
Reranker	bge-reranker-large	~1.7 GB	Retrieval result reranking
ASR	Whisper / Wav2Vec	2–6 GB	Speech-to-text transcription
VLM (Vision, small)	MedGemma-4B	~4 GB	Multimodal perception

Real Customer Deployments

Three detailed case studies + 13 industry deployment records. Data partially anonymized.

Case 1 — AI Application Company: Private Agent Platform

AI software company running Agent systems for code generation, document processing, and automated tasks 24/7. Previously on cloud API token billing — costs unpredictable, rate limits triggered at peak. Local GPU breaks even vs API cost in 1–2 months.

Config: RTX A6000 (48 GB) dedicated · Dual E5-2697v4 CPUs · 256 GB DDR4 · vLLM · Qwen3.5-27B GPTQ-Int4

First-token <500 ms 40–50 tok/s 168h continuous — zero failures $283/mo · 65% savings vs cloud API · ROI 180–430%

Case 2 — Enterprise: Knowledge Base RAG

Large enterprise with millions of documents. Internal Q&A, customer service assist. Data cannot leave the internal network.

Stack: vLLM + gpt-oss-20b (29.6 GB) + Qwen3-Embedding-8B (10.3 GB) + bge-reranker-large (1.7 GB) + Weaviate / FAISS + DIFY workflow.

Avg response 1.5–2.5 sec P99 <3.5 sec 5–8 concurrent threads 50–70% cost reduction · ROI 150–300%

Case 3 — Dev Team: Local Coding Assistant

Enterprise AI R&D team, high-frequency code generation for multiple developers. Previously on Claude/GPT APIs — code data leaving company.

Config: RTX Pro 6000 (96 GB) dedicated · 32-core CPU · 84 GB RAM · 1 Gbps unmetered · vLLM · GLM-4.7-Flash

Avg response 1–2 sec P99 <3 sec All code processed locally 50–70% cost reduction · ROI 150–300%

Additional Industry Deployment Records

Customer Type	Core Need	GPU	Model Architecture	Notes
Medical AI vendor	Multimodal clinical note generation	RTX A6000	Whisper + vision model + LLM	Medical-grade privacy
AI medical team	Image + text joint reasoning	RTX A6000	MedGemma-4B-it multimodal	Multimodal medical scene
AI application company	Chat + memory + image + image gen Agent	RTX A6000	LLM + Embedding + VLM + ComfyUI	Multi-model collaborative
Financial firm	Time-series trading + RL + risk control	RTX A6000	Transformer + RL model + FinBERT	Real-time, low-latency
Creative / AI team	Image generation workflow	RTX 5090	ComfyUI + Stable Diffusion multi-model	Blackwell bandwidth advantage
Law firm	Contract OCR + semantic search	RTX 5090	LLM + OCR + Embedding	Document privacy
Voice AI team	Stable ASR service	RTX Pro 5000	Whisper / Wav2Vec	Low power, long-running
Enterprise AI team	Knowledge base Q&A	RTX Pro 5000	Embedding + LLM + RAG	Knowledge stays on-prem
Sports data (BeSoccer)	Multi-model parallel content gen	RTX Pro 5000	Qwen3-8B-Q4 + Gemma-3-12B Q4	Multiple models simultaneously
AI content platform	Text gen + image gen multi-task	RTX Pro 5000	LLM + ComfyUI; Docker multi-container	Isolated deployment
AI application company	Dialogue + TTS voice interaction	RTX Pro 5000	Qwen3.5-35B-AWQ-4bit + CosyVoice TRT	Gunicorn + Uvicorn
AI R&D team	Text + image + voice multimodal hub	RTX Pro 5000	Qwen3.6-35B + ComfyUI + Whisper	Docker multi-container
AI application team	Voice + vision + text multimodal	RTX Pro 5000	Qwen3.5-27B-VLM-INT4 (vLLM)	Voice + vision input

Used by medical AI, fintech, law firms, and 25,000+ GPU server deployments.

SOC-certified U.S. data centers · 99.9% SLA · <5 min support response

Get Started

Who This Is (and Isn't) For

Choosing the right GPU for self-hosted LLM workloads on dedicated infrastructure is not right for every team. Here is the honest breakdown.

Good Fit

Monthly spend on cloud LLM APIs already exceeds $300 or expected to grow — switching to LLM hosting on a dedicated GPU server or GPU VPS reduces long-term cost significantly
Workloads involve patient data, legal contracts, financial reports, or proprietary code — data privacy is non-negotiable
Need 24/7 always-on inference without cold-start latency or random resource preemption
Building RAG pipelines, AI Agents, or multi-turn applications
Have basic Linux ops: SSH access, able to deploy vLLM or Ollama

Not a Good Fit

Only need a few hours of GPU time for experiments — pay-per-hour cloud makes more sense (note: Vast.ai uses third-party hosts; documented cases of instances terminated without notice)
Need thousand-GPU InfiniBand clusters for distributed hyperscale training — consider Lambda Labs or CoreWeave

Frequently Asked Questions

What is the difference between GPU VPS and a dedicated GPU server?: GPU VPS uses PCIe Passthrough to give you exclusive, non-shared access to physical GPU hardware — near bare-metal performance at lower cost. Suitable for most 7B–70B LLM server workloads. A dedicated GPU server gives you the entire physical machine exclusively: right for production AI server deployments, multi-GPU inference, training, and zero-tolerance performance workloads.
Which inference frameworks come pre-installed, and how do I enable them?: NVIDIA drivers are pre-installed on all plans. At deploy time, select from 20+ pre-configured AI frameworks including Ollama, ComfyUI, Qwen3, and Gemma3. One-click deployment from the control panel under All Products → App.
My LLM inference service is hitting OOM. What can I do without switching hardware?: In priority order: (1) Verify vLLM continuous batching is on (default); (2) quantize to INT4/INT8 via AWQ or GPTQ — 2–4× VRAM reduction; (3) reduce batch_size and max_tokens; (4) in Ollama set num_ctx explicitly; (5) last resort: CPU offloading — significant latency penalty.
How do I calculate the VRAM requirements for my LLM?: LLM GPU requirements for VRAM: Total ≈ model weights + KV Cache + 20–30% headroom. Model weights = param count (B) × precision bytes (FP16=2, INT8=1, INT4=0.5). KV Cache = 2 × layers × KV heads × head dims × precision bytes × context length × concurrency / 1e9 (GB). Example: 7B FP16 → 14 GB; same at INT4 → ~3.5 GB.
Should I use a single large-VRAM GPU or multiple smaller cards for LLM hosting?: Always prefer single large-VRAM where the model fits (A6000 48G, A100 80G, Pro 6000 96G): lowest latency, no inter-GPU communication overhead, simplest deployment. Only use multi-GPU tensor parallelism (--tensor-parallel-size 2+) when the model literally cannot fit on one card.
vLLM vs SGLang — which inference framework should I choose?: Choose vLLM for general production API, widest model compatibility (AMD/TPU/Trainium), mixed workloads. Choose SGLang for RAG, multi-turn dialogue, AI Agents (RadixAttention +29% vs vLLM), DeepSeek-class reasoning models.
How do I design rate limiting and concurrency control for a production LLM API?: Per-user RPM/TPM limits; max 1–3 concurrent requests per user; queue excess requests rather than rejecting; set max input/output tokens. Best practice: API gateway (Kong/Nginx) for auth + rate limiting, vLLM backend for batching and queuing.
What is the GPU hosting pricing model and how do I estimate monthly cost?: Transparent monthly billing with 1/3/12/24-month options. VPS: RTX Pro 2000 (16G) $95.20/mo (20% OFF); RTX Pro 4000 (24G) $159/mo (20% OFF); RTX Pro 5000 (48G) $269/mo; RTX 5090 (32G) $399/mo; RTX Pro 6000 (96G) $479/mo. Dedicated: A6000 $409/mo; A100-40G $360/mo (55% OFF); A100-80G $1,559/mo; H100-80G $2,099/mo. No hidden fees. See gpu-mart.com/pricing.
How do I deploy Hugging Face models or private models on a GPU server?: CUDA and drivers are pre-installed. Private models upload directly from local; HuggingFace models via git clone or huggingface-cli. Launch your self hosted LLM inference service with vLLM/TGI/Ollama (OpenAI-compatible API); expose REST API via port. Most users complete deployment in minutes.
Is OpenAI-compatible API supported? How do I migrate existing code to a self-hosted LLM?: vLLM/SGLang/Ollama all expose OpenAI-format compatible endpoints on your GPU server. Migration requires only changing base_url and api_key — no business logic changes. Most teams complete the switch in under 5 minutes.
Can I upgrade my GPU VPS or dedicated server configuration later?: You can upgrade to a higher GPU VPS tier or dedicated server at any time. Adding extra GPUs to the same server after deployment is not supported — select a multi-GPU plan at initial deployment.
Is there a free trial available for GPU hosting?: GPU Mart offers hourly pay-as-you-go billing for quick testing. A 24-hour free trial is also available — verify your actual workload before committing to a monthly plan.
Can I self-host an LLM on a GPU VPS?: Yes. GPU Mart LLM VPS hosting uses KVM PCI GPU Passthrough, giving you exclusive access to physical GPU hardware — the same GPU performance as a dedicated server at a lower price. You get root access to install vLLM, Ollama, or any inference framework, and can expose an OpenAI-compatible API endpoint. VPS plans start from $95/mo with 16 GB VRAM, suitable for running 7B–14B parameter models. For larger models (27B–70B+), 48–96 GB VRAM VPS options are available from $269/mo.
How much VRAM is required to self-host a 70B LLM?: VRAM requirements for a 70B model depend on quantization: FP16 (full precision) requires ~140 GB — beyond a single card; INT8 quantization requires ~70 GB, fitting on an A100-80G or H100-80G; INT4/AWQ/GPTQ quantization reduces this to ~35 GB, runnable on an RTX Pro 5000-48G or A6000-48G with two cards. For most production use cases, INT4 quantization of a 70B model on a single 48 GB GPU delivers good quality with practical latency. If quality loss from quantization is unacceptable, use two 48 GB GPUs with tensor parallelism.
What is the best GPU for self-hosted LLMs?: The best GPU for self-hosted LLM workloads depends on model size and budget. For 7B–14B models: RTX Pro 4000 (24G, $159/mo) or RTX 5090 (32G, $399/mo) for speed. For 27B–35B models: RTX Pro 5000 (48G, $269/mo) is the best value — 48 GB VRAM, Blackwell FP8/FP4 native, 178 tok/s on Ollama. For 70B+ models: A100-80G ($1,559/mo) or H100-80G ($2,099/mo). For maximum price-to-performance on FP8 models: RTX 5090 hits 144 tok/s on 8B-FP8 at $399/mo, comparable to H100 at 1/5 the cost.
Is self-hosting an LLM cheaper than using a cloud API?: For high-volume use cases, yes — significantly. A team calling GPT-4o at $5/M input + $15/M output tokens, running 2M tokens/day, spends ~$20/day or $600/month. A GPU VPS at $269/mo (RTX Pro 5000, 48G) handles the same workload with no per-token cost. Break-even is typically 1–2 months, with 50–70% long-term savings. For very low-volume usage (under 500K tokens/day), API pricing may still be more economical than maintaining dedicated GPU infrastructure.
What GPU do I need to run a 32B or 70B parameter LLM?: For a 32B model: INT4 quantization (AWQ/GPTQ) requires ~16–18 GB VRAM — fits on an RTX Pro 4000 (24G) or RTX 5090 (32G). FP16 requires ~64 GB — needs an A100-80G or two 48G cards. For a 70B model: INT4 quantization requires ~35–40 GB — fits on an RTX Pro 5000 (48G) or A6000 (48G). FP16 requires ~140 GB — requires two A100-80G or H100-80G cards with tensor parallelism. Recommended for most teams: run 32B at INT4 on RTX Pro 5000 ($269/mo) or 70B at INT4 on two A6000s ($818/mo combined).

Not sure which GPU fits your model? Talk to an expert.

Free 24-hour trial · Flat-rate billing · AI training server and inference server configs available

Get Free Expert Consultation

Self-Hosted LLMs: GPU Selection, Benchmarks, VRAM Requirements & Hosting Guide