Benchmarking Runpod cloud GPUs
People
David
Idea
Exploring cloud GPU performance, cost, and usability for running open-source AI models (in comparison to local hardware and in context of recent learning on software to handle concurrent users)
Details
- Compared RunPod serverless and pods using Nvidia (vLLM) and AMD (sglang)
- Benchmarked Nvidia 3090, 4090, RTX 6000 Ada, and AMD MI300X GPUs using Ray project's LLMPerf (8k context size)
- Tested models: Qwen3-235B-A22B-FP8 and Qwen3-30B-A3B-FP8 (both 40k context)
- Serverless GPUs had extremely long queues, making them unreliable
- Dedicated pods had much better reliability
- 2x Nvidia 4090 pod (each gpu $0.69/hr) setup struggled with throughput
- RTX 6000 Ada pod ($0.77/hr) showed solid throughput with Qwen3-30B at moderate concurrency
- Single AMD MI300X GPU pod ($2.49/hr) had outstanding throughput even at high concurrency
- Dual AMD MI300X handled the larger Qwen3-235B efficiently, though initial model load times were lengthy
- Model caching/storage for pods can cost around $20/month extra but avoids repetitive downloads