Self-Hosted LLMs: GPU Selection, Benchmarks, VRAM Requirements & Hosting Guide

Compare GPU performance, VRAM requirements, and hosting options for self-hosted LLMs. Explore benchmark data across 14 GPU configurations and learn which hardware is best for running 7B–70B AI models in production.

By GPU Mart Technical Team

Self-hosted LLMs allow organizations to run AI models on dedicated GPU infrastructure instead of relying on third-party APIs. Compared with token-based AI services, self-hosting offers greater control over text generation performance, data privacy, compliance, and long-term operating costs.

This guide compares GPU requirements, VRAM sizing, benchmark results across 14 GPU configurations, deployment frameworks, and LLM hosting architectures for running 7B–70B language models in production. Whether you are evaluating GPU VPS or dedicated GPU server options, the data here covers LLM GPU requirements, real inference benchmarks, and hosting cost comparisons to support your decision.

Why Teams Switch Away from Cloud AI APIs

Three pain points that surface once your self-hosted LLM text generation workloads move from prototype to production at scale.

Unpredictable Costs

AWS / GCP / OpenAI charge per token — the more you use, the higher the bill. High-traffic peaks trigger rate limits exactly when your users need the service most. There is no ceiling.

Unpredictable Latency

Shared GPU resources cause latency spikes you cannot control or predict. Production SLAs become impossible to guarantee when your AI server competes with thousands of other tenants on shared infrastructure.

Data Privacy Risk

Every prompt sent to a third-party API — customer data, proprietary code, internal knowledge bases — passes through infrastructure you do not control. For regulated industries, this is a hard blocker.

Why Self-Host? The Four Deployment Paths Compared

Understanding structural limits matters more than comparing spec sheets — especially for production LLM inference at scale.

Deployment PathExamplesKey AdvantageKey LimitationBest For
Cloud APIOpenAI / Anthropic / GeminiFast setup, no ops overheadData leaves your env; token costs scale; rate limits at peakPrototyping, very low volume
Model MarketplaceTogether AI / FireworksWide model selectionShared resources, unstable perf, limited customizationMid-volume testing
On-Prem Data CenterPrivate rackFull control, data sovereignty$1M+ upfront, complex opsHyperscale enterprises only
Dedicated GPU Server (Recommended)GPU MartStable + fixed cost + private data + model flexibilityRequires basic Linux opsSMB production, AI startups

GPU Mart Dedicated GPU: Four Core Advantages

Exclusive GPU Resources

No resource preemption, no performance jitter. 100% compute is yours. No noisy-neighbor interference unlike shared cloud pools.

100% Data Private

Data never leaves your LLM server. Meets enterprise compliance and GDPR. No third-party API exposure for sensitive prompts, code, or documents.

30–70% Lower Latency

A dedicated AI server eliminates shared-GPU latency spikes. First-token response is consistently fast — no cold-start delays, no queue contention.

50%+ Lower Long-Term Cost

LLM hosting on flat-rate dedicated hardware replaces per-token billing. LLM VPS hosting starts from $95/mo — predictable budgets, zero surprise invoices.

30–70%Lower latency vs shared cloud
50%+Long-term cost reduction
99.9%Uptime SLA
<5 minSupport response, 24/7

What Hardware Determines LLM Performance?

Understanding LLM GPU requirements — VRAM, memory bandwidth, and precision support — is essential before selecting the best GPU for LLM hosting. These three variables determine what model sizes you can run and how fast tokens generate.

1 — VRAM: LLM GPU Requirements by Model Size

VRAM is the top LLM GPU requirement — the hard ceiling for any self-hosted LLM. If the model doesn't fit in VRAM, no other spec matters. The VRAM requirements below use a 14B parameter model as reference — scale proportionally for larger models.

Note on Q4_K_M (Ollama default): to avoid the significant quality degradation of pure 4-bit quantization, Ollama uses a "mixed precision" approach — core weights at 6-bit, less critical weights at 4-bit. This keeps quality loss minimal while keeping a 35B model at approximately 22–23 GB VRAM.

Quantization FormatBytes / Param14B Model VRAMNotes
FP16 / BF16 (full precision)2 bytes~28 GBHighest quality, no precision loss
FP81 byte~14 GBNear-FP16 quality; requires native GPU support
INT81 byte~14 GBSlight quality loss; broad compatibility
Q4_K_M (Ollama GGUF default)~0.55 bytes~7–8 GBMixed precision (6-bit core + 4-bit other); 35B fits ~22–23 GB
INT4 / AWQ / GPTQ0.5 bytes~7 GBHeavy compression; good for constrained setups

KV Cache VRAM Requirements: Context Length & Concurrency

Beyond model weights, inference requires KV Cache memory for every active request. For Qwen2.5-14B at FP16 (24 layers, 8 KV heads, 64 dims per head):

KV Cache / Token = 2 × L × H_kv × D_head × B = 2 × 24 × 8 × 64 × 2 bytes ≈ 48 KB / Token
ConcurrencyContext Length (Tokens)KV Cache VRAMTypical Use Case
11,024~48 MBSingle-user short chat
81,024~384 MBSmall team short chat
321,024~1.5 GBMedium concurrency short chat
132,768 (32K)~1.5 GBSingle-user long document
1131,072 (128K)~6 GBSingle-user extended context

VRAM requirements formula: Total VRAM = Model Weights + KV Cache + 20–30% headroom. Always size for peak concurrency, not just model weights. This is the most common cause of OOM errors in self-hosted LLM deployments.

2 — Memory Bandwidth: How Fast Tokens Generate

Generating each token requires loading the full model weight set from VRAM into Tensor Cores. Memory bandwidth — a key LLM GPU requirement — is the hard ceiling on tok/s, not TFLOPS. A100-40G at 1,555 GB/s on a 20B FP16 model: theoretical max ≈ 38 tok/s for a single request.

Workload ProfilePrimary VRAM UsageBottleneckRecommended Strategy
Low concurrency / short contextWeights dominantMemory bandwidthHigh-bandwidth GPU: H100, RTX 5090
High concurrency / long contextKV Cache dominantCompute (Tensor Core queue)Large-VRAM GPU + batching
Offline batch processingWeights + large-batch KV CacheBandwidth & compute bothH100 / A100 + vLLM continuous batching

3 — Tensor Core Precision: Where Blackwell Pulls Ahead

On GPUs with native FP4/FP8 support: FP4 compute is 2× FP8, and FP8 is 2× FP16. FP8-quantized model weights also use half the VRAM of FP16, and FP4 uses a quarter — freeing more VRAM for context and concurrency. Only Blackwell GPUs can run LLMs at FP4 precision, achieving maximum possible throughput.

PrecisionRTX Pro 6000 (Blackwell)H100-80GA100-80GNotes
TF32234 TFLOPS312 TFLOPSTraining common precision
FP16 / BF161,000 TFLOPS989 TFLOPS312 TFLOPSMain inference precision
INT8 / FP82,000 TFLOPS~1,979 TFLOPSNo native FP8FP8: 2× throughput, half VRAM vs FP16
FP4 / INT44,000 TFLOPSNot supportedNot supportedBlackwell-exclusive: 4× vs FP16, quarter VRAM

Best GPUs for LLM Workloads

Finding the best GPU for LLM workloads depends on model size, concurrency, and budget. All specs validated on GPU Mart LLM hosting infrastructure. Source: NVIDIA official specification documents. Pricing: gpu-mart.com/pricing.

Tensor Cores Dense: FP16 dense compute (TFLOPS).  AI TOPS: max compute at lowest supported precision (sparse).  Precision: natively supported precisions.

GPU Model VRAM Mem BW Tensor Cores
Dense
(FP16)
AI TOPS
(max
prec.)
Precision Max Model Typical Use Case Price / mo Order
Data Center — Volta
V100-SXM2-16G16 GB900 GB/s21.2 TFLOPS1,248 TOPSFP16
INT8
~7B FP16Legacy production, FP16 inference$131.56 56% OFFOrder Now
Data Center — Ampere
A40-48G48 GB696 GB/s149.7 TFLOPSFP16/BF16
INT8/INT4
~27B FP8Offline batch, doc analysis$296 dedicatedOrder Now
A100-40G40 GB1,555 GB/s312 TFLOPS2,496 TOPSFP16/BF16
INT8/TF32
~14B FP16Mid-size inference/training$360 55% OFFOrder Now
A100-80G80 GB2,000 GB/s312 TFLOPS2,496 TOPSFP16/BF16
INT8/TF32
~40B FP16Large model production$1,559 dedicatedOrder Now
Data Center — Hopper
H100-80G80 GB3,350 GB/s989 TFLOPS3,958 TOPSFP16/BF16
FP8/INT8
~80B FP8High-concurrency production API$2,099 dedicatedOrder Now
Professional Workstation — Ampere
RTX A4000-16G16 GB448 GB/s76.7 TFLOPS153.4 TOPSFP16
INT8
~7B FP16 / 14B INT4Dev/test, single-user$120 VPS · $140 ded.Order Now
RTX A5000-24G24 GB768 GB/s111.1 TFLOPS222.2 TOPSFP16
INT8
~13B FP16 / 27B INT4Mid-model dev/test$269 dedicatedOrder Now
RTX A6000-48G48 GB768 GB/s154.9 TFLOPS309.7 TOPSFP16
INT8
~24B FP16 / 48B INT4Mid-large private deploy$409 dedicatedOrder Now
Consumer — Ada Lovelace
RTX 4090-24G24 GB1,008 GB/s165 TFLOPS1,321 TOPSFP16/BF16
FP8/INT4/TF32
~13B FP16 / 27B INT4High-speed small model$409 dedicatedOrder Now
Consumer — Blackwell
RTX 5090-32G32 GB1,792 GB/s419 TFLOPS3,352 TOPSFP16/BF16
FP8/FP4/INT4
~35B INT4Fast inference, small team$399 VPS · $479 ded.Order Now
Professional — Blackwell (Recommended)
ⓘ  GPU Mart GPU VPS uses KVM PCI GPU Passthrough — exclusive GPU, no shared resources
RTX Pro 2000-16G16 GB288 GB/s136 TFLOPS545 TOPSFP16/BF16
FP8/FP4/INT4
~7B FP16Lightweight, single user$99 VPSOrder Now
RTX Pro 4000-24G24 GB672 GB/s147 TFLOPS1,178 TOPSFP16/BF16
FP8/FP4
~27B INT4Small-team mid model$159 VPSOrder Now
RTX Pro 5000-48G48 GB1,344 GB/s268 TFLOPSFP16/BF16
FP8/FP4
~35B INT4 (128K)Agent, RAG, multi-model$269 VPSOrder Now
RTX Pro 6000-96G96 GB1,597 GB/s1,000 TFLOPS4,000 TOPSFP16/BF16
FP8/FP4
INT4
~122B INT4Large model single-card$479 VPSOrder Now

Quick-Decision: Best GPU for LLM by Use Case

Your ScenarioVRAMRecommended GPUReasonMonthly Price
Personal dev / testing, 7B–14B16–24 GBRTX A4000 / Pro 2000 / Pro 4000Low-cost entry; Pro 2000 VPS $95/mo, Pro 4000 VPS $159/mofrom $95/mo
Small-team production, 14B–27B24–48 GBRTX A6000 / Pro 5000-48G / A4048GB large VRAM; A6000 $409/mo, Pro 5000 $269/mo, A40 $296/mofrom $269/mo
High-speed inference, 7B–14B high concurrency24–32 GBRTX 5090 / RTX Pro 5000Blackwell extreme bandwidth; 5090 VPS $399/mofrom $399/mo
Enterprise RAG / Agent, 27B–35B48 GBRTX Pro 5000-48G / A100-40GPro 5000 VPS $269/mo; A100-40G $360/mo (55% OFF)from $269/mo
70B–120B quantized, single card80–96 GBA100-80G / H100 / Pro 6000-96GA100-80G $1,559/mo; H100 $2,099/mo; Pro 6000 VPS $479/mofrom $479/mo
100+ concurrent users production API80 GB+H100-80GHopper + FP8, industry standard; $2,099/mo$2,099/mo

All plans: flat-rate monthly billing, unmetered bandwidth, no hidden fees. Pricing subject to change — verify at gpu-mart.com/pricing.

Real Inference Benchmark Data

vLLM framework · Input 1,024 tokens + Output 512 tokens · GPU Mart production hardware · Optimized for text generation and LLM inference

Mean TTFT = Time to First Token (lower is better)  ·  P50 TTFT = median first-token latency  ·  Mean E2EL = end-to-end latency
Per-user Output Tokens/s: Average output token generation speed per request under the specified concurrency level. Reflects single-stream generation performance in a multi-user serving environment.
Aggregate Output Tokens/s: Total output token generation rate across all concurrent requests. Measures overall serving capacity excluding input tokens.

Most common enterprise deployment model · All concurrency levels shown

GPUConcurrencyMean TTFT (s)P50 TTFT (s)Per-user Output Tokens/sAggregate Output Tokens/sMean E2EL (s)
A40-48G11.7221.7775.165.1699.15
A40-48G86.6016.1864.8839.0695.03
A40-48G3212.75810.9043.0597.52158.43
A100-80G10.6300.64720.5120.5124.96
A100-80G81.0600.52620.67165.3623.61
A100-80G322.6121.06011.01352.4643.34
A6000-48G10.2710.28823.1523.1522.12
A6000-48G80.9470.91919.35154.7825.54
A6000-48G322.2271.68712.70406.4337.41
H100-80G10.1990.23640.0740.0712.78
H100-80G80.9540.71034.81278.4814.70
H100-80G321.0860.37024.26776.4619.36
RTX 5090-32G10.1640.18340.1040.1012.77
RTX 5090-32G80.5710.54931.98255.8415.45
RTX 5090-32G321.0440.39422.20710.5319.25
RTX Pro 5000-48G10.1640.18340.5540.1012.77
RTX Pro 5000-48G80.5710.54934.34255.8415.45
RTX Pro 5000-48G321.0440.39428.07710.5319.25
RTX Pro 6000-96G11.1030.14141.741.712.272
RTX Pro 6000-96G80.3950.23940.1320.612.302
RTX Pro 6000-96G321.0990.37727.2871.017.435

For 14B FP16 models, the best GPUs for LLM inference are H100, RTX 5090, and RTX Pro 5000 — all achieving ~40 tok/s single-user. A40 bottlenecks at bandwidth (~5 tok/s). A6000-48G delivers best price-to-throughput for production LLM hosting: $409/mo for 406 tok/s at 32-concurrency.

Blackwell FP8 native acceleration vs Ampere · All concurrency levels shown

GPUConcurrencyMean TTFT (s)P50 TTFT (s)Per-user Output Tokens/sAggregate Output Tokens/sMean E2EL (s)
A40-48G10.9421.08325.0625.0620.43
A40-48G81.7441.23215.74125.9331.52
A40-48G325.1271.4826.98223.2070.00
A6000-48G10.2250.20869.4269.427.37
A6000-48G80.4200.25454.70437.629.03
A6000-48G320.7680.42431.19997.9615.45
H100-80G10.1200.103122.22122.224.19
H100-80G80.2140.15087.04696.334.19
H100-80G320.5540.34845.961,470.834.19
RTX 5090-32G10.0950.080144.41144.413.55
RTX 5090-32G80.1640.121126.161,009.263.91
RTX 5090-32G320.4040.34190.302,889.665.21
RTX Pro 5000-48G10.1060.087118.62116.014.41
RTX Pro 5000-48G80.1750.111108.11804.734.90
RTX Pro 5000-48G320.3970.27981.632,269.426.66
RTX Pro 6000-96G10.0340.033133.8133.83.826
RTX Pro 6000-96G80.0610.050120.1960.54.107
RTX Pro 6000-96G320.7770.81978.22,502.76.021

For 8B-FP8 models, the best GPU for LLM inference is the RTX 5090 at 144 tok/s — nearly matching H100 (122 tok/s) at less than 1/5 the cost. RTX Pro 5000 achieves 118 tok/s with 48 GB VRAM. A6000 manages only 69 tok/s — this is Blackwell FP8 native advantage in action.

A100 vs H100 on mid-large models · FP8 quantization

GPUConcurrencyMean TTFT (s)P50 TTFT (s)Per-user Output Tokens/sAggregate Output Tokens/sMean E2EL (s)
A100-80G11.3661.32315.7515.7532.50
A100-80G84.2813.68713.22105.7637.54
A100-80G327.4807.6877.10227.1169.36
H100-80G10.3470.30837.7937.7913.55
H100-80G81.4381.52032.16257.2615.27
H100-80G322.9142.89215.61499.3930.55
RTX Pro 6000-96G10.2660.16945.845.811.183
RTX Pro 6000-96G81.9112.26935.3282.713.962
RTX Pro 6000-96G324.2554.34521.6692.622.094

H100 is 2.4× faster than A100 on 27B-FP8 (37.79 vs 15.75 tok/s). A100 hits 7.5-second TTFT at 32 concurrency — unacceptable for real-time API. For 27B+ FP8 models in production, H100 is the only correct single-GPU choice.

H100-80G only · High-concurrency limits test

GPUConcurrencyMean TTFT (s)P50 TTFT (s)Per-user Output Tokens/sAggregate Output Tokens/sMean TPOT (ms)Mean E2EL (s)
H100-80G11.0771.25815.1815.1863.9033.73
H100-80G84.9454.82211.9195.2471.2341.35
H100-80G3270.82378.2524.10131.2586.11114.83
RTX Pro 6000-96G10.3490.33121.221.224.163
RTX Pro 6000-96G82.8013.42317.1137.228.816
RTX Pro 6000-96G3217.8806.9218.7276.856.087

At concurrency 32, TTFT spikes to 70 seconds — 31B model on a single H100 hits severe queue buildup above 8 concurrent requests. For production, cap concurrency at 4–8 per card, or use multi-GPU deployment.

Ready to run these benchmarks on your own workload?

GPU VPS from $21/mo  ·  Dedicated GPU server from $49/mo  ·  No long-term commitment required
View All GPU Plans

Ollama Single-User Benchmarks

llama.cpp backend · Q4_K_M quantization · Ideal for LLM VPS hosting in single-user and dev environments · No KV Cache pre-allocation · Input 1,024 tokens + Output 512 tokens · Single user · Average of 10 requests

VRAM Usage & Maximum Supported Models

GPU (VRAM)ModelVRAM UsedContextNotes
RTX Pro 6000 (96 GB) — 120B-class models
RTX Pro 6000qwen3.5:122b95 GB262,144 (256K)Highest VRAM usage
RTX Pro 6000gpt-oss:120b70 GB131,072 (128K)
RTX Pro 6000qwen3-coder-next:latest61 GB262,144 (256K)
RTX Pro 5000 (48 GB), A6000, A40 — 35B primary workloads
RTX Pro 5000glm-4.7-flash:latest40 GB202,752 (~200K)
RTX Pro 5000qwen3.5:35b34 GB262,144 (256K)35B — full 256K context
RTX 5090 (32 GB) — 35B with context reduction
RTX 5090gemma3:27b30 GB131,072 (128K)
RTX 5090qwen3.5:35b30 GB131,072 (128K)35B — 128K context
RTX 5090qwen3.5:35b27 GB32,768 (32K)35B — 32K context
RTX 5090gemma4:31b27 GB32,768 (32K)
RTX 5090qwen3.5:35b26 GB16,384 / 8,19235B — 16K / 8K context
RTX Pro 4000 (24 GB), A5000, RTX 4090
RTX Pro 4000qwen3.6:27b24 GB32,768 (32K)
RTX Pro 4000gemma4:26b20 GB32,768 (32K)
RTX Pro 4000deepseek-v2:16b19 GB32,768 (32K)
RTX Pro 4000qwen3.5:4b17 GB262,144 (256K)Small param, ultra-long context
RTX Pro 2000 (16 GB), A4000, P100, V100
RTX Pro 2000gpt-oss:20b14 GB32,768 (32K)
RTX Pro 2000qwen3.5:9b9.9 GB32,768 (32K)Lowest VRAM usage

Context reduction tip: set num_ctx to reduce VRAM and run 35B models on 32 GB cards:

curl http://localhost:11434/api/generate -d '{"model":"qwen3.5:35b","prompt":"hello","options":{"num_ctx":32768}}'

Generation Speed by GPU

gemma4:26b — 20 GB, 32K context
GPUAvg TTFT (s) ↓P50 TTFT (s)Avg Gen Speed tok/s ↑Avg E2E Time (s) ↓
RTX 50904.9334.917149.774.93
RTX Pro 60004.9654.897140.414.97
RTX Pro 50005.0455.006136.795.05
RTX Pro 40006.1146.079107.566.11
A60006.4826.438102.306.48
A50006.8306.80093.916.83
A407.0487.00892.247.05
gpt-oss:20b — 14 GB, 32K context
GPUAvg TTFT (s) ↓P50 TTFT (s)Avg Gen Speed tok/s ↑Avg E2E Time (s) ↓
RTX 50900.6530.619214.903.67
RTX Pro 60000.5560.558202.253.62
RTX Pro 50000.6130.597178.843.98
A60000.6420.638124.665.28
RTX Pro 40000.5530.555117.605.37
A50000.6640.620109.005.85
A400.6460.64596.456.60
RTX Pro 20000.5410.53261.699.24
qwen3.5:9b — 9.9 GB, 32K context
GPUAvg TTFT (s) ↓P50 TTFT (s)Avg Gen Speed tok/s ↑Avg E2E Time (s) ↓
RTX 50900.6180.597140.454.97
RTX Pro 60000.5970.589130.045.16
RTX Pro 50000.5760.579123.135.31
A60000.7470.73280.957.73
A400.7470.73280.957.73
RTX Pro 40000.6000.60378.597.75
A50000.7570.73870.508.60
RTX Pro 20000.7460.74942.1313.45

RTX Pro 5000 (48G Blackwell) achieves 178 tok/s on gpt-oss:20b and 123 tok/s on qwen3.5:9b — approaching RTX 5090 while offering 48 GB vs 32 GB VRAM. Best overall value for single-user LLM hosting when both speed and model capacity matter.

Inference Framework Selection

Framework choice affects throughput and latency as much as GPU selection — pick the right one for your LLM inference use case.

DimensionOllamavLLMSGLangTensorRT-LLM
Design goalLocal single-userProduction high-concurrency APIHigh-throughput structured inferenceMax throughput (NVIDIA only)
Deployment complexitySimplestMediumMediumVery high (requires compilation)
Cold start timeSeconds~62 sec~58 sec~28 min
Single-user TTFT~65 ms~10.7 ms~11–12 ms~10.5 ms
High-concurrency throughputLow (~484 tok/s)HighHigher (+17–29% vs vLLM)Highest
VRAM usageLow (INT4, no pre-alloc)High (pre-allocated)High (pre-allocated)High
FP8/FP4 supportPartialFullFullFull
OpenAI-compatible APIYesYesYesYes
GPU supportNVIDIA + Apple MNVIDIA + AMD + TPUNVIDIA + AMDNVIDIA only

Ollama

Dev / Test

One-command setup, model auto-download. Best for dev, single-user text generation, rapid evaluation.

vLLM

Production First Choice

General production API, widest model compatibility (incl. AMD/TPU). Continuous batching default on.

SGLang

RAG / Agent

RadixAttention prefix caching delivers +29% text generation throughput vs vLLM. Ideal for RAG, multi-turn, DeepSeek-class models.

TensorRT-LLM

Advanced Only

Only if: max throughput on pure NVIDIA stack AND you accept 28-min compilation per model version.

Deployment Optimization Tips

Reduce VRAM (Fix OOM)

  • Enable vLLM continuous batching (default on): dynamically merges requests, lowers peak VRAM
  • Quantize to INT4/INT8 via AWQ/GPTQ: 2–4× VRAM reduction, minimal quality loss
  • Reduce max_tokens and batch_size: cuts peak KV Cache usage
  • In Ollama: set num_ctx explicitly to allocate only what you need
  • Last resort: CPU offloading (latency penalty; only for extreme VRAM shortage)

Improve Throughput & Speed

  • Choose high-bandwidth GPUs: bandwidth caps tok/s. H100 > RTX 5090 > A100 > A6000
  • Use FP8 models + FP8-capable GPU: doubles throughput at same VRAM — also reduces weight memory footprint so more context and concurrency fits
  • Enable Speculative Decoding: small draft model assists large model, reduces TPOT
  • Multi-GPU tensor parallelism: vLLM/SGLang support --tensor-parallel-size

Auxiliary Models for <16 GB GPUs

Model TypeExample ModelTypical VRAMUse Case
EmbeddingQwen3-Embedding-8B~10 GBRAG vector encoding
Rerankerbge-reranker-large~1.7 GBRetrieval result reranking
ASRWhisper / Wav2Vec2–6 GBSpeech-to-text transcription
VLM (Vision, small)MedGemma-4B~4 GBMultimodal perception

Real Customer Deployments

Three detailed case studies + 13 industry deployment records. Data partially anonymized.

Case 1 — AI Application Company: Private Agent Platform

AI software company running Agent systems for code generation, document processing, and automated tasks 24/7. Previously on cloud API token billing — costs unpredictable, rate limits triggered at peak. Local GPU breaks even vs API cost in 1–2 months.

Config: RTX A6000 (48 GB) dedicated · Dual E5-2697v4 CPUs · 256 GB DDR4 · vLLM · Qwen3.5-27B GPTQ-Int4

First-token <500 ms 40–50 tok/s 168h continuous — zero failures $283/mo · 65% savings vs cloud API · ROI 180–430%

Case 2 — Enterprise: Knowledge Base RAG

Large enterprise with millions of documents. Internal Q&A, customer service assist. Data cannot leave the internal network.

Stack: vLLM + gpt-oss-20b (29.6 GB) + Qwen3-Embedding-8B (10.3 GB) + bge-reranker-large (1.7 GB) + Weaviate / FAISS + DIFY workflow.

Avg response 1.5–2.5 sec P99 <3.5 sec 5–8 concurrent threads 50–70% cost reduction · ROI 150–300%

Case 3 — Dev Team: Local Coding Assistant

Enterprise AI R&D team, high-frequency code generation for multiple developers. Previously on Claude/GPT APIs — code data leaving company.

Config: RTX Pro 6000 (96 GB) dedicated · 32-core CPU · 84 GB RAM · 1 Gbps unmetered · vLLM · GLM-4.7-Flash

Avg response 1–2 sec P99 <3 sec All code processed locally 50–70% cost reduction · ROI 150–300%

Additional Industry Deployment Records

Customer TypeCore NeedGPUModel ArchitectureNotes
Medical AI vendorMultimodal clinical note generationRTX A6000Whisper + vision model + LLMMedical-grade privacy
AI medical teamImage + text joint reasoningRTX A6000MedGemma-4B-it multimodalMultimodal medical scene
AI application companyChat + memory + image + image gen AgentRTX A6000LLM + Embedding + VLM + ComfyUIMulti-model collaborative
Financial firmTime-series trading + RL + risk controlRTX A6000Transformer + RL model + FinBERTReal-time, low-latency
Creative / AI teamImage generation workflowRTX 5090ComfyUI + Stable Diffusion multi-modelBlackwell bandwidth advantage
Law firmContract OCR + semantic searchRTX 5090LLM + OCR + EmbeddingDocument privacy
Voice AI teamStable ASR serviceRTX Pro 5000Whisper / Wav2VecLow power, long-running
Enterprise AI teamKnowledge base Q&ARTX Pro 5000Embedding + LLM + RAGKnowledge stays on-prem
Sports data (BeSoccer)Multi-model parallel content genRTX Pro 5000Qwen3-8B-Q4 + Gemma-3-12B Q4Multiple models simultaneously
AI content platformText gen + image gen multi-taskRTX Pro 5000LLM + ComfyUI; Docker multi-containerIsolated deployment
AI application companyDialogue + TTS voice interactionRTX Pro 5000Qwen3.5-35B-AWQ-4bit + CosyVoice TRTGunicorn + Uvicorn
AI R&D teamText + image + voice multimodal hubRTX Pro 5000Qwen3.6-35B + ComfyUI + WhisperDocker multi-container
AI application teamVoice + vision + text multimodalRTX Pro 5000Qwen3.5-27B-VLM-INT4 (vLLM)Voice + vision input

Used by medical AI, fintech, law firms, and 25,000+ GPU server deployments.

SOC-certified U.S. data centers  ·  99.9% SLA  ·  <5 min support response
Get Started

Who This Is (and Isn't) For

Choosing the right GPU for self-hosted LLM workloads on dedicated infrastructure is not right for every team. Here is the honest breakdown.

Good Fit

  • Monthly spend on cloud LLM APIs already exceeds $300 or expected to grow — switching to LLM hosting on a dedicated GPU server or GPU VPS reduces long-term cost significantly
  • Workloads involve patient data, legal contracts, financial reports, or proprietary code — data privacy is non-negotiable
  • Need 24/7 always-on inference without cold-start latency or random resource preemption
  • Building RAG pipelines, AI Agents, or multi-turn applications
  • Have basic Linux ops: SSH access, able to deploy vLLM or Ollama

Not a Good Fit

  • Only need a few hours of GPU time for experiments — pay-per-hour cloud makes more sense (note: Vast.ai uses third-party hosts; documented cases of instances terminated without notice)
  • Need thousand-GPU InfiniBand clusters for distributed hyperscale training — consider Lambda Labs or CoreWeave

Frequently Asked Questions

What is the difference between GPU VPS and a dedicated GPU server?
GPU VPS uses PCIe Passthrough to give you exclusive, non-shared access to physical GPU hardware — near bare-metal performance at lower cost. Suitable for most 7B–70B LLM server workloads. A dedicated GPU server gives you the entire physical machine exclusively: right for production AI server deployments, multi-GPU inference, training, and zero-tolerance performance workloads.
Which inference frameworks come pre-installed, and how do I enable them?
NVIDIA drivers are pre-installed on all plans. At deploy time, select from 20+ pre-configured AI frameworks including Ollama, ComfyUI, Qwen3, and Gemma3. One-click deployment from the control panel under All Products → App.
My LLM inference service is hitting OOM. What can I do without switching hardware?
In priority order: (1) Verify vLLM continuous batching is on (default); (2) quantize to INT4/INT8 via AWQ or GPTQ — 2–4× VRAM reduction; (3) reduce batch_size and max_tokens; (4) in Ollama set num_ctx explicitly; (5) last resort: CPU offloading — significant latency penalty.
How do I calculate the VRAM requirements for my LLM?
LLM GPU requirements for VRAM: Total ≈ model weights + KV Cache + 20–30% headroom. Model weights = param count (B) × precision bytes (FP16=2, INT8=1, INT4=0.5). KV Cache = 2 × layers × KV heads × head dims × precision bytes × context length × concurrency / 1e9 (GB). Example: 7B FP16 → 14 GB; same at INT4 → ~3.5 GB.
Should I use a single large-VRAM GPU or multiple smaller cards for LLM hosting?
Always prefer single large-VRAM where the model fits (A6000 48G, A100 80G, Pro 6000 96G): lowest latency, no inter-GPU communication overhead, simplest deployment. Only use multi-GPU tensor parallelism (--tensor-parallel-size 2+) when the model literally cannot fit on one card.
vLLM vs SGLang — which inference framework should I choose?
Choose vLLM for general production API, widest model compatibility (AMD/TPU/Trainium), mixed workloads. Choose SGLang for RAG, multi-turn dialogue, AI Agents (RadixAttention +29% vs vLLM), DeepSeek-class reasoning models.
How do I design rate limiting and concurrency control for a production LLM API?
Per-user RPM/TPM limits; max 1–3 concurrent requests per user; queue excess requests rather than rejecting; set max input/output tokens. Best practice: API gateway (Kong/Nginx) for auth + rate limiting, vLLM backend for batching and queuing.
What is the GPU hosting pricing model and how do I estimate monthly cost?
Transparent monthly billing with 1/3/12/24-month options. VPS: RTX Pro 2000 (16G) $95.20/mo (20% OFF); RTX Pro 4000 (24G) $159/mo (20% OFF); RTX Pro 5000 (48G) $269/mo; RTX 5090 (32G) $399/mo; RTX Pro 6000 (96G) $479/mo. Dedicated: A6000 $409/mo; A100-40G $360/mo (55% OFF); A100-80G $1,559/mo; H100-80G $2,099/mo. No hidden fees. See gpu-mart.com/pricing.
How do I deploy Hugging Face models or private models on a GPU server?
CUDA and drivers are pre-installed. Private models upload directly from local; HuggingFace models via git clone or huggingface-cli. Launch your self hosted LLM inference service with vLLM/TGI/Ollama (OpenAI-compatible API); expose REST API via port. Most users complete deployment in minutes.
Is OpenAI-compatible API supported? How do I migrate existing code to a self-hosted LLM?
vLLM/SGLang/Ollama all expose OpenAI-format compatible endpoints on your GPU server. Migration requires only changing base_url and api_key — no business logic changes. Most teams complete the switch in under 5 minutes.
Can I upgrade my GPU VPS or dedicated server configuration later?
You can upgrade to a higher GPU VPS tier or dedicated server at any time. Adding extra GPUs to the same server after deployment is not supported — select a multi-GPU plan at initial deployment.
Is there a free trial available for GPU hosting?
GPU Mart offers hourly pay-as-you-go billing for quick testing. A 24-hour free trial is also available — verify your actual workload before committing to a monthly plan.
Can I self-host an LLM on a GPU VPS?
Yes. GPU Mart LLM VPS hosting uses KVM PCI GPU Passthrough, giving you exclusive access to physical GPU hardware — the same GPU performance as a dedicated server at a lower price. You get root access to install vLLM, Ollama, or any inference framework, and can expose an OpenAI-compatible API endpoint. VPS plans start from $95/mo with 16 GB VRAM, suitable for running 7B–14B parameter models. For larger models (27B–70B+), 48–96 GB VRAM VPS options are available from $269/mo.
How much VRAM is required to self-host a 70B LLM?
VRAM requirements for a 70B model depend on quantization: FP16 (full precision) requires ~140 GB — beyond a single card; INT8 quantization requires ~70 GB, fitting on an A100-80G or H100-80G; INT4/AWQ/GPTQ quantization reduces this to ~35 GB, runnable on an RTX Pro 5000-48G or A6000-48G with two cards. For most production use cases, INT4 quantization of a 70B model on a single 48 GB GPU delivers good quality with practical latency. If quality loss from quantization is unacceptable, use two 48 GB GPUs with tensor parallelism.
What is the best GPU for self-hosted LLMs?
The best GPU for self-hosted LLM workloads depends on model size and budget. For 7B–14B models: RTX Pro 4000 (24G, $159/mo) or RTX 5090 (32G, $399/mo) for speed. For 27B–35B models: RTX Pro 5000 (48G, $269/mo) is the best value — 48 GB VRAM, Blackwell FP8/FP4 native, 178 tok/s on Ollama. For 70B+ models: A100-80G ($1,559/mo) or H100-80G ($2,099/mo). For maximum price-to-performance on FP8 models: RTX 5090 hits 144 tok/s on 8B-FP8 at $399/mo, comparable to H100 at 1/5 the cost.
Is self-hosting an LLM cheaper than using a cloud API?
For high-volume use cases, yes — significantly. A team calling GPT-4o at $5/M input + $15/M output tokens, running 2M tokens/day, spends ~$20/day or $600/month. A GPU VPS at $269/mo (RTX Pro 5000, 48G) handles the same workload with no per-token cost. Break-even is typically 1–2 months, with 50–70% long-term savings. For very low-volume usage (under 500K tokens/day), API pricing may still be more economical than maintaining dedicated GPU infrastructure.
What GPU do I need to run a 32B or 70B parameter LLM?
For a 32B model: INT4 quantization (AWQ/GPTQ) requires ~16–18 GB VRAM — fits on an RTX Pro 4000 (24G) or RTX 5090 (32G). FP16 requires ~64 GB — needs an A100-80G or two 48G cards. For a 70B model: INT4 quantization requires ~35–40 GB — fits on an RTX Pro 5000 (48G) or A6000 (48G). FP16 requires ~140 GB — requires two A100-80G or H100-80G cards with tensor parallelism. Recommended for most teams: run 32B at INT4 on RTX Pro 5000 ($269/mo) or 70B at INT4 on two A6000s ($818/mo combined).

Not sure which GPU fits your model? Talk to an expert.

Free 24-hour trial  ·  Flat-rate billing  ·  AI training server and inference server configs available
Get Free Expert Consultation

Start Your Self Hosted LLM Deployment Today

GPU Mart — Dedicated GPU Hosting for Workloads That Never Stop.

Choose the GPU configuration that fits your model Transparent monthly billing, no hidden fees Free 24-hour trial & expert selection guidance