(Post-WIP) Comparing LLM inference engines (multi-user and multi-model)

  • People
    • David
  • Idea
    • For a given set of local LLM models that a set of users (or community) wants to use - what is the best/easiest way to serve it (based on ease of use, speed of processing and generation, and features like multi-context and
  • Details
    • Engines like llama.cpp-server, LM Studio, TabbyAPI, Ollama, etc have different tradeoffs (ease-of-use, speed, memory-use, multi-user features, etc)
    • For a deployment of a local LLM like the hackathon (a small Gemma model) - what system of deployment makes the most system for a collection of users to serve a single or multiple models and handle concurrent requests as performantly as possible

Read more