From Benchmarks to Builders: Running MiniMax M2.1 on Our Mac Studio

David Pickett

22 Feb 2026 — 2 min read

Last week I wrote about two paths to vLLM on Apple Silicon - comparing vllm-metal and vllm-mlx as options for local inference. This week the picture changed. LM Studio shipped concurrent request support, and suddenly the simplest option became the most practical one.

People

David

Idea

We spent the week stress-testing local LLM inference on our Mac Studio M3 Ultra (512GB) using Harbor's terminal-bench benchmark - trying to find the right configuration to serve MiniMax M2.1 to our builders for agentic coding work on our own hardware.

Details

LM Studio 0.4 added concurrent inference, which was the missing piece - prior to this, every request through LM Studio had to wait in line
We ran 25 test configurations across vllm-mlx, LM Studio (MLX and GGUF), and vLLM on Blackwell GPU, tracking memory pressure, timeouts, accuracy, and server crashes
vllm-mlx is fast early on but degrades badly as conversation context grows - it crashed the Metal GPU with an unrecoverable OOM error at 4 concurrent tasks (note this was a special branch for Minimax, the main line may have fixes now)
LM Studio never crashed once across all our tests, even when we pushed it to 6 concurrent conversations and 2.3 million tokens over an hour
The concurrency sweet spot on 512GB is 3-4 simultaneous tasks - going to 6 caused swap thrashing, and going to 5 with vllm-mlx froze the whole system (at least without quantizing the kv cache)
MLX 5-bit and GGUF Q5_K_XL quants performed surprisingly similar on LM Studio - about 26 vs 30 minutes for the same three tasks, both completing clean (we would assume that Q5_K_XL has higher accuracy though)
On our 18-task accuracy baseline, MiniMax M2.1 5-bit solved 11 tasks (61%) including code generation, optimization, and security challenges
8-bit KV cache quantization cut memory use but added a per-call slowdown that wiped out any concurrency gains - not worth it at current settings (also we noticed an accuracy drop)
Keep "separate reasoning from content" turned OFF in LM Studio - turning it on caused a 10x slowdown and 5 of 10 tasks to timeout in our tests (this may just be a weird interaction with the Terminus2.0 reference agent from harbor)
Our projection for a full 89-task run on LM Studio at -n 4 concurrency (with 5bit MLX without quantizing the cache) is roughly 18-20 passes at about 20% accuracy - a 30% relative drop from the full-precision API model results, which is a reasonable trade for running everything on-prem on a single Mac

Connecting Attio CRM and Google Drive to Claude Desktop via MCP

People Donavan Idea Set up MCP servers for CRM and Google Drive to connect them directly to Claude Desktop - giving Claude real-time access to our deal pipeline, contacts, and documents without copy-pasting or context-switching. Details * MCP (Model Context Protocol) for connecting AI assistants to external tools and data sources.

Two paths to vLLM on Apple Silicon - vllm-metal vs vllm-mlx

People David Idea I tested both vllm-metal and vllm-mlx on the M3 Ultra Mac Studio to see which one gives better multi-user LLM throughput on a high-memory Apple Silicon machine. Details * Both projects showed up in late 2025 - vllm on Metal/MLX is very exciting because it should let

Policy-Based LLM Routing with Nvidia's Open Source Blueprint

People David Idea Testing Nvidia's v1 LLM Router blueprint as a third approach to intelligent query routing - this time using policy-based task classification instead of semantic similarity or neural network intent matching. Details * Nvidia's v1 LLM Router blueprint (main branch) takes a three-step approach: apply

Neural Network Intent Routing with UIUC's LLMRouter

People: David Idea: Tested UIUC's LLMRouter framework as an alternative to LiteLLM's semantic routing - this one trains an actual neural network for intent classification and can run on hardware as small as a Raspberry Pi with 4GB of ram. Details: * Wanted to compare this against