(Post-WIP) Comparing LLM inference engines (multi-user and multi-model)

David Pickett

15 Apr 2025

People
- David
Idea
- For a given set of local LLM models that a set of users (or community) wants to use - what is the best/easiest way to serve it (based on ease of use, speed of processing and generation, and features like multi-context and
Details
- Engines like llama.cpp-server, LM Studio, TabbyAPI, Ollama, etc have different tradeoffs (ease-of-use, speed, memory-use, multi-user features, etc)
- For a deployment of a local LLM like the hackathon (a small Gemma model) - what system of deployment makes the most system for a collection of users to serve a single or multiple models and handle concurrent requests as performantly as possible

Local hardware fine-tuning LLMs for Hawaiian-English translation benchmarking

People * David Idea Exploring memory-efficient fine-tuning techniques for improving Hawaiian-to-English translation using Apple's MLX framework, comparing multiple approaches and optimizing for Mac hardware. Details * Successfully fine-tuned gemma-3-4b-it-4bit on Mac M1 Ultra (128GB RAM) achieving 0.8296 semantic similarity score, a 3.6% improvement over the base model * Discovered

Fine-tuning performance between Apple and Nvidia

People: * David Idea: * Comparing fine-tuning performance on MacBook M3 Max, Mac Studio M1 Ultra, and Nvidia 4090 using MLX and Unsloth Details: * Tested fine-tuning Phi-3-mini-4k-instruct model * Followed this Jan 2025 MLX guide for Apple hardware * Used Unsloth library for Nvidia GPU * Dataset had 627 examples and used 500 training steps

Benchmarking Runpod cloud GPUs

People David Idea Exploring cloud GPU performance, cost, and usability for running open-source AI models (in comparison to local hardware and in context of recent learning on software to handle concurrent users) Details * Compared RunPod serverless and pods using Nvidia (vLLM) and AMD (sglang) * Benchmarked Nvidia 3090, 4090, RTX 6000

Comparing Onyx vs Morphik (whole PMF Deep Research)

People: * David Idea: * Comparing two open-source projects, Onyx and Morphik, for self-hosted semantic search (and deep research) across Google Drive files (PDFs, Docs, Sheets, and Slides) Details: * Onyx: Great text-based search, flexible embedding models * Morphik: Powerful multi-modal search (text, images, graphs) * Onyx easily connects and syncs with Google Drive * Morphik

Read more

Local hardware fine-tuning LLMs for Hawaiian-English translation benchmarking

Fine-tuning performance between Apple and Nvidia

Benchmarking Runpod cloud GPUs

Comparing Onyx vs Morphik (whole PMF Deep Research)