mistral.rs by EricLBuehler - RepoMind Architecture & Analysis

<div align="center"> <img src="https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/res/banner.png" alt="mistral.rs" width="100%" style="max-width: 800px;"> </div> <h3 align="center"> Fast, flexible LLM inference. </h3> <p align="center"> | <a href="https://ericlbuehler.github.io/mistral.rs/"><b>Documentation</b></a> | <a href="https://crates.io/crates/mistralrs"><b>Rust SDK</b></a> | <a href="https://ericlbuehler.github.io/mistral.rs/PYTHON_SDK.html"><b>Python SDK</b></a> | <a href="https://discord.gg/SZrecqK8qw"><b>Discord</b></a> | </p> <p align="center"> <a href="https://github.com/EricLBuehler/mistral.rs/stargazers"> <img src="https://img.shields.io/github/stars/EricLBuehler/mistral.rs?style=social&label=Star" alt="GitHub stars"> </a> </p>

Why mistral.rs?

Any HuggingFace model, zero config: Just mistralrs run -m user/model. Auto-detects architecture, quantization, chat template.
True multimodality: Vision, audio, speech generation, image generation, embeddings.
Not another model registry: Use HuggingFace models directly. No converting, no uploading to a separate service.
Full quantization control: Choose the precise quantization you want to use, or make your own UQFF with mistralrs quantize.
Built-in web UI: mistralrs serve --ui gives you a web interface instantly.
Hardware-aware: mistralrs tune benchmarks your system and picks optimal quantization + device mapping.
Flexible SDKs: Python package and Rust crate to build your projects.

Quick Start

Install

Linux/macOS:

curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

Windows (PowerShell):

irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

Manual installation & other platforms

Run Your First Model

# Interactive chat
mistralrs run -m Qwen/Qwen3-4B

# Or start a server with web UI
mistralrs serve --ui -m google/gemma-3-4b-it

Then visit http://localhost:1234/ui for the web chat interface.

The `mistralrs` CLI

The CLI is designed to be zero-config: just point it at a model and go.

Auto-detection: Automatically detects model architecture, quantization format, and chat template
All-in-one: Single binary for chat, server, benchmarks, and web UI (run, serve, bench)
Hardware tuning: Run mistralrs tune to automatically benchmark and configure optimal settings for your hardware
Format-agnostic: Works with Hugging Face models, GGUF files, and UQFF quantizations seamlessly

# Auto-tune for your hardware and emit a config file
mistralrs tune -m Qwen/Qwen3-4B --emit-config config.toml

# Run using the generated config
mistralrs from-config -f config.toml

# Diagnose system issues (CUDA, Metal, HuggingFace connectivity)
mistralrs doctor

Full CLI documentation

What Makes It Fast

Performance

Continuous batching support by default on all devices.
CUDA with FlashAttention V2/V3, Metal, multi-GPU tensor parallelism
PagedAttention for high throughput continuous batching on CUDA or Apple Silicon, prefix caching (including multimodal)

Quantization (full docs)

In-situ quantization (ISQ) of any Hugging Face model
GGUF (2-8 bit), GPTQ, AWQ, HQQ, FP8, BNB support
⭐ Per-layer topology: Fine-tune quantization per layer for optimal quality/speed
⭐ Auto-select fastest quant method for your hardware

Flexibility

LoRA & X-LoRA with weight merging
AnyMoE: Create mixture-of-experts on any base model
Multiple models: Load/unload at runtime

Agentic Features

Integrated tool calling with Python/Rust callbacks
⭐ Web search integration
⭐ MCP client: Connect to external tools automatically

Full feature documentation

Supported Models

<details> <summary><b>Text Models</b></summary>

Granite 4.0
SmolLM 3
DeepSeek V3
GPT-OSS
DeepSeek V2
Qwen 3 Next
Qwen 3 MoE
Phi 3.5 MoE
Qwen 3
GLM 4
GLM-4.7-Flash
GLM-4.7 (MoE)
Gemma 2
Qwen 2
Starcoder 2
Phi 3
Mixtral
Phi 2
Gemma
Llama
Mistral

</details> <details> <summary><b>Vision Models</b></summary>

Qwen 3-VL
Qwen 3-VL MoE
Gemma 3n
Llama 4
Gemma 3
Mistral 3
Phi 4 multimodal
Qwen 2.5-VL
MiniCPM-O
Llama 3.2 Vision
Qwen 2-VL
Idefics 3
Idefics 2
LLaVA Next
LLaVA
Phi 3V

</details> <details> <summary><b>Speech Models</b></summary>

Voxtral (ASR/speech-to-text)
Dia

</details> <details> <summary><b>Image Generation Models</b></summary>

FLUX

</details> <details> <summary><b>Embedding Models</b></summary>

Embedding Gemma
Qwen 3 Embedding

</details>

Request a new model | Full compatibility tables

Python SDK

pip install mistralrs  # or mistralrs-cuda, mistralrs-metal, mistralrs-mkl, mistralrs-accelerate

from mistralrs import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.Plain(model_id="Qwen/Qwen3-4B"),
    in_situ_quant="4",
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=256,
    )
)
print(res.choices[0].message.content)

Python SDK | Installation | Examples | Cookbook

Rust SDK

cargo add mistralrs

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, TextMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = VisionModelBuilder::new("google/gemma-3-4b-it")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new().add_message(
        TextMessageRole::User,
        "Hello!",
    );

    let response = model.send_chat_request(messages).await?;

    println!("{:?}", response.choices[0].message.content);

    Ok(())
}

API Docs | Crate | Examples

Docker

For quick containerized deployment:

docker pull ghcr.io/ericlbuehler/mistral.rs:latest
docker run --gpus all -p 1234:1234 ghcr.io/ericlbuehler/mistral.rs:latest \
  serve -m Qwen/Qwen3-4B

Docker images

For production use, we recommend installing the CLI directly for maximum flexibility.

Documentation

For complete documentation, see the Documentation.

Quick Links:

CLI Reference - All commands and options
HTTP API - OpenAI-compatible endpoints
Quantization - ISQ, GGUF, GPTQ, and more
Device Mapping - Multi-GPU and CPU offloading
MCP Integration - MCP integration documentation
Troubleshooting - Common issues and solutions
Configuration - Environment variables for configuration

Contributing

Contributions welcome! Please open an issue to discuss new features or report bugs. If you want to add a new model, please contact us via an issue and we can coordinate.

Credits

This project would not be possible without the excellent work at Candle. Thank you to all contributors!

mistral.rs is not affiliated with Mistral AI.

EricLBuehler / mistral.rs

AI Architecture Analysis

Embed this Badge

Repository Summary (README)

Why mistral.rs?

Quick Start

Install

Run Your First Model

The `mistralrs` CLI

What Makes It Fast

Supported Models

Python SDK

Rust SDK

Docker

Documentation

Contributing

Credits

AI Architecture Analysis

Embed this Badge

Repository Summary (README)

Why mistral.rs?

Quick Start

Install

Run Your First Model

The mistralrs CLI

What Makes It Fast

Supported Models

Python SDK

Rust SDK

Docker

Documentation

Contributing

Credits

The `mistralrs` CLI