Two paths to vLLM on Apple Silicon - vllm-metal vs vllm-mlx

David Pickett

13 Feb 2026 — 1 min read

People

David

Idea

I tested both vllm-metal and vllm-mlx on the M3 Ultra Mac Studio to see which one gives better multi-user LLM throughput on a high-memory Apple Silicon machine.

Details

Both projects showed up in late 2025 - vllm on Metal/MLX is very exciting because it should let us handle multi-user concurrency better than LM-Studio while still getting the benefit of MLX (much faster prompt processing than GGUF)
vllm-metal is the official community plugin under the vllm-project GitHub org, and vllm-mlx is an independent reimplementation with a paper accepted at EuroMLSys '26
Both use Apple's MLX framework under the hood - the "Metal" in vllm-metal refers to the GPU API that MLX compiles down to, not hand-written Metal shaders
vllm-metal requires building vLLM v0.13.0 from source plus a Rust toolchain, while vllm-mlx installs with a simple pip install vllm-mlx (but vllm-metal has an easy shell script installer)
vllm-mlx publishes real benchmarks (525 tok/s on Qwen3-0.6B 4-bit, 21-87% faster than llama.cpp on M4 Max) while vllm-metal has published none so far
vllm-mlx supports multimodal workloads out of the box - text, vision, audio, and embeddings - while vllm-metal is text-only for now
vllm-mlx exposes OpenAI-compatible, Anthropic Messages API, and MCP endpoints, though we don't really need it because we have LiteLLM
vllm-metal works for me on small basic models and is promising given the first-party organizational backing from the vLLM project
vllm-mlx gave us immediate results and value - we are starting to experiment on it with the new Minimax M2.1 MLX 5-bit model (from a PR branch)
Both projects are sub-v1.0 and experimental
For some performance numbers, in my tests with the Minimax M2.1 MLX 5-bit model I could get about 50% higher overall throughput on generation with vllm-mlx compared to LM-Studios single-threaded operation (and the MLX format gives us a 5x increase to prompt processing speed compared to GGUF/llama.cpp)

Connecting Attio CRM and Google Drive to Claude Desktop via MCP

People Donavan Idea Set up MCP servers for CRM and Google Drive to connect them directly to Claude Desktop - giving Claude real-time access to our deal pipeline, contacts, and documents without copy-pasting or context-switching. Details * MCP (Model Context Protocol) for connecting AI assistants to external tools and data sources.

Policy-Based LLM Routing with Nvidia's Open Source Blueprint

People David Idea Testing Nvidia's v1 LLM Router blueprint as a third approach to intelligent query routing - this time using policy-based task classification instead of semantic similarity or neural network intent matching. Details * Nvidia's v1 LLM Router blueprint (main branch) takes a three-step approach: apply

Neural Network Intent Routing with UIUC's LLMRouter

People: David Idea: Tested UIUC's LLMRouter framework as an alternative to LiteLLM's semantic routing - this one trains an actual neural network for intent classification and can run on hardware as small as a Raspberry Pi with 4GB of ram. Details: * Wanted to compare this against

Basic Semantic Routing with LiteLLM Proxy

People: David Idea: Testing semantic routing as a way to automatically send LLM requests to different models based on what the user is asking—part of a bigger vision for smart edge-to-hub-to-cloud routing on our Maui cluster. Details: * Our LiteLLM proxy already bundles multiple machines into one API endpoint, so