(Post-WIP) Comparing LLM inference engines (multi-user and multi-model)
- People
- David
- Idea
- For a given set of local LLM models that a set of users (or community) wants to use - what is the best/easiest way to serve it (based on ease of use, speed of processing and generation, and features like multi-context and
- Details
- Engines like llama.cpp-server, LM Studio, TabbyAPI, Ollama, etc have different tradeoffs (ease-of-use, speed, memory-use, multi-user features, etc)
- For a deployment of a local LLM like the hackathon (a small Gemma model) - what system of deployment makes the most system for a collection of users to serve a single or multiple models and handle concurrent requests as performantly as possible