Two paths to vLLM on Apple Silicon - vllm-metal vs vllm-mlx
People
David
Idea
I tested both vllm-metal and vllm-mlx on the M3 Ultra Mac Studio to see which one gives better multi-user LLM throughput on a high-memory Apple Silicon machine.
Details
- Both projects showed up in late 2025 - vllm on Metal/MLX is very exciting because it should let us handle multi-user concurrency better than LM-Studio while still getting the benefit of MLX (much faster prompt processing than GGUF)
- vllm-metal is the official community plugin under the vllm-project GitHub org, and vllm-mlx is an independent reimplementation with a paper accepted at EuroMLSys '26
- Both use Apple's MLX framework under the hood - the "Metal" in vllm-metal refers to the GPU API that MLX compiles down to, not hand-written Metal shaders
- vllm-metal requires building vLLM v0.13.0 from source plus a Rust toolchain, while vllm-mlx installs with a simple
pip install vllm-mlx(but vllm-metal has an easy shell script installer) - vllm-mlx publishes real benchmarks (525 tok/s on Qwen3-0.6B 4-bit, 21-87% faster than llama.cpp on M4 Max) while vllm-metal has published none so far
- vllm-mlx supports multimodal workloads out of the box - text, vision, audio, and embeddings - while vllm-metal is text-only for now
- vllm-mlx exposes OpenAI-compatible, Anthropic Messages API, and MCP endpoints, though we don't really need it because we have LiteLLM
- vllm-metal works for me on small basic models and is promising given the first-party organizational backing from the vLLM project
- vllm-mlx gave us immediate results and value - we are starting to experiment on it with the new Minimax M2.1 MLX 5-bit model (from a PR branch)
- Both projects are sub-v1.0 and experimental
- For some performance numbers, in my tests with the Minimax M2.1 MLX 5-bit model I could get about 50% higher overall throughput on generation with vllm-mlx compared to LM-Studios single-threaded operation (and the MLX format gives us a 5x increase to prompt processing speed compared to GGUF/llama.cpp)