Run locally

Open models under 25GB you can run on your own hardware — with Ollama & Docker commands.

catalog updated 25m ago
Everything here runs on your own hardware. "Under 25 GB" means it fits a single 24 GB consumer GPU (RTX 3090/4090) — or a laptop with enough RAM on CPU. Pick by benchmark, copy the Ollama or Docker command, and run it. Sizes assume Q4_K_M quantization for LLMs.
303 models
🧪 Local sandbox checking…
Spin up a temporary Ollama service via Docker and test any model right here — the output runs on your hardware, no API keys needed. Click “Test locally” on any card, or type a model tag.
◎ Reasoning & chat 80
Qwen3 0.6B
0.5 GB
Alibaba (Qwen Team) · 0.6B · Apache-2.0
📊 Small-scale; hybrid think/non-think; punches above size on reasoning
<1GB VRAM, runs on CPU / edge
ollamaollama run qwen3:0.6b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:0.6b
Hugging Face ↗ · ollama
Llama 3.2 1B Instruct
0.8 GB
Meta · 1.23B · Llama 3.2 Community License
📊 MMLU 49.3, IFEval 59.5, GSM8K 44.4
~1GB VRAM, runs easily on CPU / phones
ollamaollama run llama3.2:1b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llama3.2:1b
Hugging Face ↗ · ollama
Gemma 3 1B Instruct
0.8 GB
Google DeepMind · 1B · Gemma Terms of Use
📊 Text-only; solid small-model chat
<1GB VRAM, runs on CPU / mobile
ollamaollama run gemma3:1b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma3:1b
Hugging Face ↗ · ollama
Falcon3-1B-Instruct
1 GB
TII (UAE) · 1.7B · TII Falcon-LLM License 2.0
📊 Capable tiny model for size
<1GB VRAM, runs on CPU / edge
ollamaollama run falcon3:1b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run falcon3:1b
Hugging Face ↗ · ollama
DeepSeek-R1-Distill-Qwen-1.5B
1.1 GB
DeepSeek · 1.5B · MIT (distill; base Apache-2.0)
📊 Strong math reasoning for 1.5B (AIME/MATH); CoT traces
~2GB VRAM, runs on CPU
ollamaollama run deepseek-r1:1.5b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-r1:1.5b
Hugging Face ↗ · ollama
SmolLM2-1.7B-Instruct
1.1 GB
Hugging Face · 1.7B · Apache-2.0
📊 Strong tiny on-device chat; good IFEval for size
~2GB VRAM, runs on CPU
ollamaollama run smollm2:1.7b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run smollm2:1.7b
Hugging Face ↗ · ollama
Qwen3 1.7B
1.4 GB
Alibaba (Qwen Team) · 1.7B · Apache-2.0
📊 Strong for size on math/reasoning vs Qwen2.5-3B
~2GB VRAM, runs on CPU
ollamaollama run qwen3:1.7b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:1.7b
Hugging Face ↗ · ollama
IBM Granite 3.3 2B Instruct
1.5 GB
IBM · 2.5B · Apache-2.0
📊 Compact enterprise model; thinking mode + FIM
~2GB VRAM, runs on CPU
ollamaollama run granite3.3:2b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite3.3:2b
Hugging Face ↗ · ollama
EXAONE 3.5 2.4B Instruct
1.6 GB
LG AI Research · 2.4B · EXAONE AI Model License (non-commercial/research)
📊 Efficient bilingual small model for edge
~2GB VRAM, runs on CPU
ollamaollama run exaone3.5:2.4b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone3.5:2.4b
Hugging Face ↗ · ollama
EXAONE Deep 2.4B
1.6 GB
LG AI Research · 2.4B · EXAONE AI Model License (non-commercial/research)
📊 AIME 2025 47.9; outperforms comparable-size reasoners on math
~2GB VRAM, runs on CPU
ollamaollama run exaone-deep:2.4b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone-deep:2.4b
Hugging Face ↗ · ollama
VibeThinker-3B-GGUF
1.8 GB
prithivMLmods · 3B · mit · discovered
~3GB VRAM, or CPU with 3GB RAM
ollamaollama run hf.co/prithivMLmods/VibeThinker-3B-GGUF
dockerdocker exec -it ollama ollama run hf.co/prithivMLmods/VibeThinker-3B-GGUF
Hugging Face ↗ · ollama
SmolLM3-3B
1.9 GB
Hugging Face · 3B · Apache-2.0
📊 Strong at 3B-4B scale; dual-mode reasoning, 6 languages, long context
~3GB VRAM, runs on CPU
ollamaollama run hf.co/ggml-org/SmolLM3-3B-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/ggml-org/SmolLM3-3B-GGUF
Hugging Face ↗ · ollama
Llama 3.2 3B Instruct
2 GB
Meta · 3.21B · Llama 3.2 Community License
📊 MMLU 63.4, IFEval 77.4, GSM8K 77.7, HumanEval ~50
~3GB VRAM, runs on CPU
ollamaollama run llama3.2:3b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llama3.2:3b
Hugging Face ↗ · ollama
Falcon3-3B-Instruct
2 GB
TII (UAE) · 3.2B · TII Falcon-LLM License 2.0
📊 Strong small model via pruning+distillation
~2-3GB VRAM, runs on CPU
ollamaollama run falcon3:3b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run falcon3:3b
Hugging Face ↗ · ollama
IBM Granite 4.0 Micro (3B)
2.1 GB
IBM · 3B · Apache-2.0
📊 Improved instruction following + tool calling; 12 languages
~2-3GB VRAM, runs on CPU
ollamaollama run granite4:micro
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite4:micro
Hugging Face ↗ · ollama
Phi-3.5-mini-instruct (3.8B)
2.3 GB
Microsoft · 3.8B · MIT
📊 MMLU ~69, strong reasoning for 3.8B, 128K context
~3-4GB VRAM, runs on CPU
ollamaollama run phi3.5
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run phi3.5
Hugging Face ↗ · ollama
NVIDIA-Nemotron-3-Nano-4B-GGUF
2.4 GB
nvidia · 4B · other · discovered
~4GB VRAM, or CPU with 4GB RAM
ollamaollama run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
dockerdocker exec -it ollama ollama run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
Hugging Face ↗ · ollama
Qwen3 4B
2.5 GB
Alibaba (Qwen Team) · 4B · Apache-2.0
📊 Rivals Qwen2.5-72B-Instruct on several tasks (Qwen claim)
~4GB VRAM, runs on CPU
ollamaollama run qwen3:4b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:4b
Hugging Face ↗ · ollama
Phi-4-mini-instruct (3.8B)
2.5 GB
Microsoft · 3.8B · MIT
📊 Strong multilingual + reasoning for 3.8B; function calling
~3-4GB VRAM, runs on CPU
ollamaollama run phi4-mini
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run phi4-mini
Hugging Face ↗ · ollama
Phi-4-mini-reasoning (3.8B)
2.5 GB
Microsoft · 3.8B · MIT
📊 Math-focused; distilled from DeepSeek-R1 synthetic math data
~3-4GB VRAM, runs on CPU
ollamaollama run hf.co/unsloth/Phi-4-mini-reasoning-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/unsloth/Phi-4-mini-reasoning-GGUF
Hugging Face ↗ · ollama
Yi-1.5-6B-Chat
3.6 GB
01.AI · 6B · Apache-2.0
📊 Solid small bilingual chat
~4GB VRAM, runs on CPU
ollamaollama run yi:6b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run yi:6b
Hugging Face ↗ · ollama
Falcon3-7B-Instruct
4.3 GB
TII (UAE) · 7B · TII Falcon-LLM License 2.0
📊 SOTA-class under 13B at release; strong math/reasoning
~5GB VRAM, runs on CPU
ollamaollama run falcon3:7b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run falcon3:7b
Hugging Face ↗ · ollama
Mistral 7B Instruct v0.3
4.4 GB
Mistral AI · 7.25B · Apache-2.0
📊 MMLU ~62, classic strong 7B baseline, function calling
~5GB VRAM, runs on CPU
ollamaollama run mistral:7b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run mistral:7b
Hugging Face ↗ · ollama
DeepSeek-R1-Distill-Qwen-7B
4.7 GB
DeepSeek · 7.6B · MIT (distill; base Apache-2.0)
📊 AIME/MATH strong for 7B; outperforms many non-reasoning models
~6GB VRAM, runs on CPU
ollamaollama run deepseek-r1:7b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-r1:7b
Hugging Face ↗ · ollama
Ministral 8B Instruct
4.8 GB
Mistral AI · 8B · Mistral Research License (MRL)
📊 Beats Mistral 7B and Llama 3.1 8B on many tasks; 128K context
~6GB VRAM, runs on CPU
ollamaollama run hf.co/mistralai/Ministral-8B-Instruct-2410
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/mistralai/Ministral-8B-Instruct-2410
Hugging Face ↗ · ollama
EXAONE 3.5 7.8B Instruct
4.8 GB
LG AI Research · 7.8B · EXAONE AI Model License (non-commercial/research)
📊 Strong bilingual EN/KO instruction-following
~6GB VRAM, runs on CPU
ollamaollama run exaone3.5:7.8b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone3.5:7.8b
Hugging Face ↗ · ollama
EXAONE Deep 7.8B
4.8 GB
LG AI Research · 7.8B · EXAONE AI Model License (non-commercial/research)
📊 AIME 2025 59.6; strong math/science/coding reasoning for size
~6GB VRAM, runs on CPU
ollamaollama run exaone-deep:7.8b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone-deep:7.8b
Hugging Face ↗ · ollama
Bonsai-8B-gguf
4.8 GB
prism-ml · 8B · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/prism-ml/Bonsai-8B-gguf
dockerdocker exec -it ollama ollama run hf.co/prism-ml/Bonsai-8B-gguf
Hugging Face ↗ · ollama
LFM2.5-8B-A1B-GGUF
4.8 GB
LiquidAI · 8B · other · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/LiquidAI/LFM2.5-8B-A1B-GGUF
dockerdocker exec -it ollama ollama run hf.co/LiquidAI/LFM2.5-8B-A1B-GGUF
Hugging Face ↗ · ollama
Llama 3.1 8B Instruct
4.9 GB
Meta · 8.03B · Llama 3.1 Community License
📊 MMLU 69.4, HumanEval 72.6, GSM8K 84.5, IFEval 80.4
~6GB VRAM, runs on CPU
ollamaollama run llama3.1:8b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llama3.1:8b
Hugging Face ↗ · ollama
DeepSeek-R1-Distill-Llama-8B
4.9 GB
DeepSeek · 8B · llama3.1 license (distill MIT)
📊 Strong CoT math/reasoning for 8B
~6GB VRAM, runs on CPU
ollamaollama run deepseek-r1:8b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-r1:8b
Hugging Face ↗ · ollama
IBM Granite 3.3 8B Instruct
4.9 GB
IBM · 8.1B · Apache-2.0
📊 Enterprise-tuned; thinking mode, FIM, strong RAG/tool use
~6GB VRAM, runs on CPU
ollamaollama run granite3.3:8b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite3.3:8b
Hugging Face ↗ · ollama
InternLM3-8B-Instruct
4.9 GB
Shanghai AI Lab (InternLM) · 8B · Apache-2.0
📊 Surpasses Llama3.1-8B and Qwen2.5-7B on reasoning/knowledge tasks
~6GB VRAM, runs on CPU
ollamaollama run internlm/internlm3-8b-instruct
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run internlm/internlm3-8b-instruct
Hugging Face ↗ · ollama
GLM-5.2-GGUF
5 GB
unsloth · mit · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/unsloth/GLM-5.2-GGUF
dockerdocker exec -it ollama ollama run hf.co/unsloth/GLM-5.2-GGUF
Hugging Face ↗ · ollama
deepseek-v4-gguf
5 GB
antirez · mit · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/antirez/deepseek-v4-gguf
dockerdocker exec -it ollama ollama run hf.co/antirez/deepseek-v4-gguf
Hugging Face ↗ · ollama
Qwable-v1-GGUF
5 GB
lordx64 · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/lordx64/Qwable-v1-GGUF
dockerdocker exec -it ollama ollama run hf.co/lordx64/Qwable-v1-GGUF
Hugging Face ↗ · ollama
supra-title-50M-pre-gguf
5 GB
SupraLabs · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/SupraLabs/supra-title-50M-pre-gguf
dockerdocker exec -it ollama ollama run hf.co/SupraLabs/supra-title-50M-pre-gguf
Hugging Face ↗ · ollama
Supra-1.5-50M-instruct-exp-gguf
5 GB
SupraLabs · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/SupraLabs/Supra-1.5-50M-instruct-exp-gguf
dockerdocker exec -it ollama ollama run hf.co/SupraLabs/Supra-1.5-50M-instruct-exp-gguf
Hugging Face ↗ · ollama
GLM-5.2-REAP50-Q3_K_M-GGUF
5 GB
pipenetwork · mit · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/pipenetwork/GLM-5.2-REAP50-Q3_K_M-GGUF
dockerdocker exec -it ollama ollama run hf.co/pipenetwork/GLM-5.2-REAP50-Q3_K_M-GGUF
Hugging Face ↗ · ollama
Z-Image-Engineer-V6-GGUF
5 GB
BennyDaBall · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/BennyDaBall/Z-Image-Engineer-V6-GGUF
dockerdocker exec -it ollama ollama run hf.co/BennyDaBall/Z-Image-Engineer-V6-GGUF
Hugging Face ↗ · ollama
GLM-4.7-Flash-GGUF
5 GB
unsloth · mit · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/unsloth/GLM-4.7-Flash-GGUF
dockerdocker exec -it ollama ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF
Hugging Face ↗ · ollama
Cohere Command-R7B
5.1 GB
Cohere · 7B · CC-BY-NC 4.0 (non-commercial) + C4AI Acceptable Use
📊 Top-tier speed/quality for 7B; excels at RAG, tool use, agents; 23 languages
~5GB VRAM, runs on CPU / edge
ollamaollama run command-r7b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run command-r7b
Hugging Face ↗ · ollama
Qwen3 8B
5.2 GB
Alibaba (Qwen Team) · 8.2B · Apache-2.0
📊 MMLU ~77, strong math/code; hybrid thinking
~6GB VRAM, runs on CPU
ollamaollama run qwen3:8b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:8b
Hugging Face ↗ · ollama
Yi-1.5-9B-Chat
5.3 GB
01.AI · 8.8B · Apache-2.0
📊 Strong bilingual (EN/ZH) chat; competitive ~9B coding/math
~6-7GB VRAM, runs on CPU
ollamaollama run yi:9b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run yi:9b
Hugging Face ↗ · ollama
GLM-4-9B-Chat
5.7 GB
Zhipu AI / Z.ai (THUDM) · 9.4B · GLM-4 License (free for many uses; check terms)
📊 Beats Llama-3-8B on semantics/math/reasoning/code/knowledge; 26 languages
~7GB VRAM, runs on CPU
ollamaollama run glm4:9b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run glm4:9b
Hugging Face ↗ · ollama
Gemma 2 9B Instruct
5.8 GB
Google DeepMind · 9.2B · Gemma Terms of Use
📊 MMLU ~71, beat Llama-3-8B on many tasks at release
~6-7GB VRAM, runs on CPU
ollamaollama run gemma2:9b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma2:9b
Hugging Face ↗ · ollama
Falcon3-10B-Instruct
6.3 GB
TII (UAE) · 10.3B · TII Falcon-LLM License 2.0
📊 Best-in-class under 13B at release (depth up-scaled from 7B)
~7-8GB VRAM
ollamaollama run falcon3:10b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run falcon3:10b
Hugging Face ↗ · ollama
gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF
7.2 GB
yuxinlu1 · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
ollamaollama run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF
dockerdocker exec -it ollama ollama run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF
Hugging Face ↗ · ollama
gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF
7.2 GB
yuxinlu1 · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
ollamaollama run hf.co/yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF
dockerdocker exec -it ollama ollama run hf.co/yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF
Hugging Face ↗ · ollama
Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF
7.2 GB
yuxinlu1 · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
ollamaollama run hf.co/yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF
dockerdocker exec -it ollama ollama run hf.co/yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF
Hugging Face ↗ · ollama
Gwimi-4-12B-IT-GGUF
7.2 GB
trjxter · 12B · gemma · discovered
~10GB VRAM (RTX 3090/4090)
ollamaollama run hf.co/trjxter/Gwimi-4-12B-IT-GGUF
dockerdocker exec -it ollama ollama run hf.co/trjxter/Gwimi-4-12B-IT-GGUF
Hugging Face ↗ · ollama
Qwen3.6-14B-A3B-FableVibes-GGUF
8.4 GB
tvall43 · 14B · apache-2.0 · discovered
~11GB VRAM (RTX 3090/4090)
ollamaollama run hf.co/tvall43/Qwen3.6-14B-A3B-FableVibes-GGUF
dockerdocker exec -it ollama ollama run hf.co/tvall43/Qwen3.6-14B-A3B-FableVibes-GGUF
Hugging Face ↗ · ollama
Qwen3.6-14B-A3B-VibeForged-v2-GGUF
8.4 GB
tvall43 · 14B · apache-2.0 · discovered
~11GB VRAM (RTX 3090/4090)
ollamaollama run hf.co/tvall43/Qwen3.6-14B-A3B-VibeForged-v2-GGUF
dockerdocker exec -it ollama ollama run hf.co/tvall43/Qwen3.6-14B-A3B-VibeForged-v2-GGUF
Hugging Face ↗ · ollama
DeepSeek-R1-Distill-Qwen-14B
9 GB
DeepSeek · 14.8B · MIT (distill; base Apache-2.0)
📊 Approaches o1-mini on reasoning; strong AIME/MATH
~10-12GB VRAM
ollamaollama run deepseek-r1:14b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-r1:14b
Hugging Face ↗ · ollama
Phi-4 (14B)
9.1 GB
Microsoft · 14.7B · MIT
📊 MMLU 84.8, GPQA 56.1, MATH 80.4, HumanEval 82.6
~10-12GB VRAM
ollamaollama run phi4
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run phi4
Hugging Face ↗ · ollama
Phi-4-reasoning (14B)
9.1 GB
Microsoft · 14.7B · MIT
📊 AIME 2024 75.3, HumanEval+ 92.9, IFEval 83.4, OmniMath 76.6
~10-12GB VRAM
ollamaollama run hf.co/unsloth/Phi-4-reasoning-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/unsloth/Phi-4-reasoning-GGUF
Hugging Face ↗ · ollama
Phi-4-reasoning-plus (14B)
9.1 GB
Microsoft · 14.7B · MIT
📊 AIME 2024 81.3, AIME 2025 82.5, HumanEval+ 92.3, IFEval 84.9, OmniMath 81.9
~10-12GB VRAM
ollamaollama run hf.co/unsloth/Phi-4-reasoning-plus-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/unsloth/Phi-4-reasoning-plus-GGUF
Hugging Face ↗ · ollama
Qwen3 14B
9.3 GB
Alibaba (Qwen Team) · 14.8B · Apache-2.0
📊 GPQA ~60s, strong AIME/LiveCodeBench for size
~10-12GB VRAM
ollamaollama run qwen3:14b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:14b
Hugging Face ↗ · ollama
Qwopus-GLM-18B-Merged-GGUF
10.8 GB
Jackrong · 18B · apache-2.0 · discovered
~13GB VRAM (RTX 3090/4090)
ollamaollama run hf.co/Jackrong/Qwopus-GLM-18B-Merged-GGUF
dockerdocker exec -it ollama ollama run hf.co/Jackrong/Qwopus-GLM-18B-Merged-GGUF
Hugging Face ↗ · ollama
GLM-4.7-Flash-REAP-23B-A3B-GGUF
13.8 GB
unsloth · 23B · mit · discovered
~16GB VRAM (RTX 3090/4090)
ollamaollama run hf.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF
dockerdocker exec -it ollama ollama run hf.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF
Hugging Face ↗ · ollama
gpt-oss-20b
14 GB
OpenAI · 21B · Apache-2.0
📊 ~OpenAI o3-mini level on core reasoning; strong tool use / function calling
~16GB memory (runs on 16GB edge devices)
ollamaollama run gpt-oss:20b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gpt-oss:20b
Hugging Face ↗ · ollama
Mistral Small 3.2 24B Instruct
15 GB
Mistral AI · 24B · Apache-2.0
📊 Comparable to much larger models; improved instruction following, function calling, less repetition vs 3.1
~15-16GB VRAM (fits RTX 4090 / 32GB Mac)
ollamaollama run mistral-small3.2
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run mistral-small3.2
Hugging Face ↗ · ollama
Qwen3.6-27B-MTP-pi-tune-GGUF
16.2 GB
bytkim · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
ollamaollama run hf.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF
dockerdocker exec -it ollama ollama run hf.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF
Hugging Face ↗ · ollama
qwen3.6-27b-fable5-lora
16.2 GB
hotdogs · 27B · agpl-3.0 · discovered
~20GB VRAM (24GB GPU)
ollamaollama run hf.co/hotdogs/qwen3.6-27b-fable5-lora
dockerdocker exec -it ollama ollama run hf.co/hotdogs/qwen3.6-27b-fable5-lora
Hugging Face ↗ · ollama
Qwen3.6-27B-MTP-TQ3_4S
16.2 GB
YTan2000 · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
ollamaollama run hf.co/YTan2000/Qwen3.6-27B-MTP-TQ3_4S
dockerdocker exec -it ollama ollama run hf.co/YTan2000/Qwen3.6-27B-MTP-TQ3_4S
Hugging Face ↗ · ollama
qwen3.6-27b-cybersecurity-lora
16.2 GB
hotdogs · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
ollamaollama run hf.co/hotdogs/qwen3.6-27b-cybersecurity-lora
dockerdocker exec -it ollama ollama run hf.co/hotdogs/qwen3.6-27b-cybersecurity-lora
Hugging Face ↗ · ollama
Qwen3.6-27B-IQ4_XS-pure-with-MTP-GGUF
16.2 GB
GianniDPC · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
ollamaollama run hf.co/GianniDPC/Qwen3.6-27B-IQ4_XS-pure-with-MTP-GGUF
dockerdocker exec -it ollama ollama run hf.co/GianniDPC/Qwen3.6-27B-IQ4_XS-pure-with-MTP-GGUF
Hugging Face ↗ · ollama
Gemma 2 27B Instruct
16.5 GB
Google DeepMind · 27.2B · Gemma Terms of Use
📊 MMLU ~75, strong text chat at release
~17GB VRAM
ollamaollama run gemma2:27b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma2:27b
Hugging Face ↗ · ollama
Qwen3.6 27B
17 GB
Alibaba (Qwen Team) · 27B · Apache-2.0
📊 Flagship-level coding in a 27B dense model (Qwen3.6 release); strong agentic coding + thinking preservation
~17GB VRAM (fits 24GB card)
ollamaollama run qwen3.6:27b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3.6:27b
Hugging Face ↗ · ollama
Gemma-4-31B-JANG_4M-CRACK-GGUF
18.6 GB
douyamv · 31B · gemma · discovered
~22GB VRAM (24GB GPU)
ollamaollama run hf.co/douyamv/Gemma-4-31B-JANG_4M-CRACK-GGUF
dockerdocker exec -it ollama ollama run hf.co/douyamv/Gemma-4-31B-JANG_4M-CRACK-GGUF
Hugging Face ↗ · ollama
Qwen3 30B-A3B (MoE)
19 GB
Alibaba (Qwen Team) · 30.5B · Apache-2.0
📊 MMLU-Redux 89.3, GPQA 70.4, AIME25 70.9, LiveCodeBench v5 62.6 (2507 update)
~19GB VRAM; only ~3B active so fast even partly on CPU
ollamaollama run qwen3:30b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:30b
Hugging Face ↗ · ollama
IBM Granite 4.0 Small (32B-A9B MoE, hybrid Mamba-2)
19 GB
IBM · 32B · Apache-2.0
📊 Hybrid Mamba-2 + attention; efficient long-context enterprise tasks
~19GB VRAM; MoE ~9B active so memory-efficient
ollamaollama run granite4:small
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite4:small
Hugging Face ↗ · ollama
EXAONE 3.5 32B Instruct
19 GB
LG AI Research · 32B · EXAONE AI Model License (non-commercial/research)
📊 Powerful bilingual EN/KO performance at 32B
~19GB VRAM (fits 24GB card)
ollamaollama run exaone3.5:32b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone3.5:32b
Hugging Face ↗ · ollama
EXAONE Deep 32B
19 GB
LG AI Research · 32B · EXAONE AI Model License (non-commercial/research)
📊 AIME 2024 90.0; matches DeepSeek-R1 (671B) on AIME 2025
~19GB VRAM (fits 24GB card)
ollamaollama run exaone-deep:32b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone-deep:32b
Hugging Face ↗ · ollama
Qwen3 32B
20 GB
Alibaba (Qwen Team) · 32.8B · Apache-2.0
📊 Flagship dense Qwen3; competitive with much larger models on reasoning/code
~20GB VRAM (fits 24GB card)
ollamaollama run qwen3:32b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:32b
Hugging Face ↗ · ollama
DeepSeek-R1-Distill-Qwen-32B
20 GB
DeepSeek · 32.8B · MIT (distill; base Apache-2.0)
📊 Outperforms OpenAI o1-mini; SOTA dense reasoning at release
~20GB VRAM (fits 24GB card)
ollamaollama run deepseek-r1:32b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-r1:32b
Hugging Face ↗ · ollama
QwQ-32B
20 GB
Alibaba (Qwen Team) · 32.5B · Apache-2.0
📊 Competitive with DeepSeek-R1 on math/reasoning despite 32B size
~20GB VRAM (fits 24GB card)
ollamaollama run qwq
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwq
Hugging Face ↗ · ollama
Yi-1.5-34B-Chat
20.6 GB
01.AI · 34.4B · Apache-2.0
📊 Competitive with much larger models on chat/reasoning at release
~21GB VRAM (fits 24GB card)
ollamaollama run yi:34b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run yi:34b
Hugging Face ↗ · ollama
SIQ-1-35B
21 GB
AlexWortega · 35B · apache-2.0 · discovered
~24GB VRAM (24GB GPU)
ollamaollama run hf.co/AlexWortega/SIQ-1-35B
dockerdocker exec -it ollama ollama run hf.co/AlexWortega/SIQ-1-35B
Hugging Face ↗ · ollama
Qwen3.6-35B-A3B-REAP-90pct-GGUF
21 GB
DJLougen · 35B · apache-2.0 · discovered
~24GB VRAM (24GB GPU)
ollamaollama run hf.co/DJLougen/Qwen3.6-35B-A3B-REAP-90pct-GGUF
dockerdocker exec -it ollama ollama run hf.co/DJLougen/Qwen3.6-35B-A3B-REAP-90pct-GGUF
Hugging Face ↗ · ollama
⌨ Coding 35
Qwen2.5-Coder-0.5B-Instruct
0.4 GB
Alibaba (Qwen) · 0.5B · Apache-2.0
📊 HumanEval 61.6, MBPP 52.4
~1GB VRAM, runs easily on CPU
ollamaollama run qwen2.5-coder:0.5b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:0.5b
Hugging Face ↗ · ollama
DeepSeek-Coder-1.3B-Instruct
0.8 GB
DeepSeek · 1.3B · DeepSeek License (permits commercial use)
📊 HumanEval 65.2, MBPP 49.4
~1-2GB VRAM, runs on CPU
ollamaollama run deepseek-coder:1.3b-instruct
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-coder:1.3b-instruct
Hugging Face ↗ · ollama
Yi-Coder-1.5B-Chat
0.9 GB
01.AI · 1.5B · Apache-2.0
📊 HumanEval ~41.5, LiveCodeBench ~12; 52 languages, 128K context
~2GB VRAM, runs on CPU
ollamaollama run yi-coder:1.5b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run yi-coder:1.5b
Hugging Face ↗ · ollama
Qwen2.5-Coder-1.5B-Instruct
1 GB
Alibaba (Qwen) · 1.5B · Apache-2.0
📊 HumanEval 70.7, MBPP 69.2
~2GB VRAM, runs on CPU
ollamaollama run qwen2.5-coder:1.5b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:1.5b
Hugging Face ↗ · ollama
OpenCoder-1.5B-Instruct
1 GB
INF (infly) / OpenCoder team · 1.5B · INF Open-Source License (commercial use permitted)
📊 HumanEval 72.5 (HumanEval+ 67.7), MBPP 72.7, BigCodeBench 33.3, LiveCodeBench 12.8
~2GB VRAM, runs on CPU
ollamaollama run hf.co/QuantFactory/OpenCoder-1.5B-Instruct-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/QuantFactory/OpenCoder-1.5B-Instruct-GGUF
Hugging Face ↗ · ollama
CodeGemma-2B
1.6 GB
Google · 2B · Gemma Terms of Use (commercial OK with use restrictions)
📊 HumanEval 31.1, MBPP 43.6 (base, code-completion focused)
~2-3GB VRAM, runs on CPU
ollamaollama run codegemma:2b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codegemma:2b
Hugging Face ↗ · ollama
StarCoder2-3B
1.8 GB
BigCode (ServiceNow/HuggingFace/NVIDIA) · 3B · BigCode OpenRAIL-M (commercial OK, responsible-use clauses)
📊 HumanEval ~31.7, strong FIM; trained on The Stack v2 (17 langs, 3T+ tokens)
~2-3GB VRAM, runs on CPU
ollamaollama run starcoder2:3b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run starcoder2:3b
Hugging Face ↗ · ollama
Qwen2.5-Coder-3B-Instruct
1.9 GB
Alibaba (Qwen) · 3B · Qwen Research License (non-commercial)
📊 HumanEval 84.1, MBPP 73.6
~3GB VRAM, runs on CPU
ollamaollama run qwen2.5-coder:3b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:3b
Hugging Face ↗ · ollama
Granite-3B-Code-Instruct-128K
2 GB
IBM · 3B · Apache-2.0
📊 HumanEvalSynthesize ~exceeds CodeLlama-34B-Instruct; enterprise RAG/tool-use tuned
~3GB VRAM, runs on ~4GB RAM / CPU
ollamaollama run granite-code:3b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite-code:3b
Hugging Face ↗ · ollama
Opus4.7-GODs.Ghost.Codex-4B.GGuF
2.4 GB
WithinUsAI · 4B · discovered
~4GB VRAM, or CPU with 4GB RAM
ollamaollama run hf.co/WithinUsAI/Opus4.7-GODs.Ghost.Codex-4B.GGuF
dockerdocker exec -it ollama ollama run hf.co/WithinUsAI/Opus4.7-GODs.Ghost.Codex-4B.GGuF
Hugging Face ↗ · ollama
DeepSeek-Coder-6.7B-Instruct
3.8 GB
DeepSeek · 6.7B · DeepSeek License (permits commercial use)
📊 HumanEval 78.6, MBPP 65.4, DS-1000 strong
~5GB VRAM (6GB+ GPU), runs on CPU
ollamaollama run deepseek-coder:6.7b-instruct
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-coder:6.7b-instruct
Hugging Face ↗ · ollama
CodeLlama-7B-Instruct
3.8 GB
Meta · 7B · Llama 2 Community License (commercial OK; >700M MAU must request Meta license)
📊 HumanEval 34.8 (instruct), MBPP ~44
~5GB VRAM, runs on CPU
ollamaollama run codellama:7b-instruct
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codellama:7b-instruct
Hugging Face ↗ · ollama
StarCoder2-7B
4 GB
BigCode (ServiceNow/HuggingFace/NVIDIA) · 7B · BigCode OpenRAIL-M (commercial OK, responsible-use clauses)
📊 HumanEval 35.4, trained on The Stack v2 (17 langs, 3.5T+ tokens)
~5GB VRAM, runs on CPU (slow)
ollamaollama run starcoder2:7b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run starcoder2:7b
Hugging Face ↗ · ollama
Codestral-Mamba-7B (mamba-codestral-7B-v0.1)
4.4 GB
Mistral AI · 7.3B · Apache-2.0
📊 HumanEval 75.0, beats CodeGemma-1.1-7B (61) and DeepSeek-v1.5-7B (66)
~5GB VRAM; linear-time Mamba2 inference scales to long sequences cheaply
ollamaollama run hf.co/Agnuxo/Mamba-Codestral-7B-v0.1-instruct-python_coding_assistant-GGUF_4bit
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/Agnuxo/Mamba-Codestral-7B-v0.1-instruct-python_coding_assistant-GGUF_4bit
Hugging Face ↗ · ollama
Granite-8B-Code-Instruct-128K
4.6 GB
IBM · 8B · Apache-2.0
📊 HumanEvalSynthesize Python 62.2 (avg 51.4), MBPP solid; 116 languages
~5-6GB VRAM (8GB GPU)
ollamaollama run granite-code:8b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite-code:8b
Hugging Face ↗ · ollama
Qwen2.5-Coder-7B-Instruct
4.7 GB
Alibaba (Qwen) · 7B · Apache-2.0
📊 HumanEval 88.4, MBPP 83.5, Aider ~57
~6GB VRAM, runs on CPU (slow)
ollamaollama run qwen2.5-coder:7b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:7b
Hugging Face ↗ · ollama
OpenCoder-8B-Instruct
4.8 GB
INF (infly) / OpenCoder team · 8B · INF Open-Source License (commercial use permitted)
📊 HumanEval 83.5 (HumanEval+ 78.7), MBPP 79.1, BigCodeBench 40.3, LiveCodeBench 23.2
~6GB VRAM (8GB GPU), runs on CPU (slow)
ollamaollama run hf.co/QuantFactory/OpenCoder-8B-Instruct-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/QuantFactory/OpenCoder-8B-Instruct-GGUF
Hugging Face ↗ · ollama
CodeGemma-7B-it (v1.1)
5 GB
Google · 7B · Gemma Terms of Use (commercial OK with use restrictions)
📊 HumanEval 60.4 (v1.1; 56.1 v1.0), MBPP 55.2
~6GB VRAM, runs on CPU (slow)
ollamaollama run codegemma:7b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codegemma:7b
Hugging Face ↗ · ollama
Qwen3-Coder-Next-GGUF
5 GB
unsloth · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollamaollama run hf.co/unsloth/Qwen3-Coder-Next-GGUF
dockerdocker exec -it ollama ollama run hf.co/unsloth/Qwen3-Coder-Next-GGUF
Hugging Face ↗ · ollama
Yi-Coder-9B-Chat
5.4 GB
01.AI · 9B · Apache-2.0
📊 HumanEval 85.4, MBPP 73.8, LiveCodeBench 23.4 (only sub-10B model above 20%)
~6-7GB VRAM (8GB+ GPU)
ollamaollama run yi-coder:9b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run yi-coder:9b
Hugging Face ↗ · ollama
gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
7.2 GB
yuxinlu1 · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
ollamaollama run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
dockerdocker exec -it ollama ollama run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
Hugging Face ↗ · ollama
RavenX-OpenFable-Coderagent-gemma-4-12B-coder-fable5-composer-Soulinfused-Remastered-GGUF
7.2 GB
deadbydawn101 · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
ollamaollama run hf.co/deadbydawn101/RavenX-OpenFable-Coderagent-gemma-4-12B-coder-fable5-composer-Soulinfused-Remastered-GGUF
dockerdocker exec -it ollama ollama run hf.co/deadbydawn101/RavenX-OpenFable-Coderagent-gemma-4-12B-coder-fable5-composer-Soulinfused-Remastered-GGUF
Hugging Face ↗ · ollama
CodeLlama-13B-Instruct
7.4 GB
Meta · 13B · Llama 2 Community License (commercial OK; >700M MAU must request Meta license)
📊 HumanEval 36.0 (base; instruct ~42.7), MBPP ~49
~8GB VRAM (8GB/12GB GPU)
ollamaollama run codellama:13b-instruct
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codellama:13b-instruct
Hugging Face ↗ · ollama
DeepSeek-Coder-V2-Lite-Instruct (16B MoE)
8.9 GB
DeepSeek · 16B · DeepSeek License (permits commercial use)
📊 HumanEval 81.1, MBPP+ 68.8, supports 338 languages
~10-11GB VRAM incl. KV cache (16GB GPU); MoE only activates 2.4B params so it's fast
ollamaollama run deepseek-coder-v2:16b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-coder-v2:16b
Hugging Face ↗ · ollama
Qwen2.5-Coder-14B-Instruct
9 GB
Alibaba (Qwen) · 14B · Apache-2.0
📊 HumanEval 89.6, MBPP 86.2, Aider ~62
~11GB VRAM (fits 12GB/16GB GPUs)
ollamaollama run qwen2.5-coder:14b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:14b
Hugging Face ↗ · ollama
StarCoder2-15B
9.1 GB
BigCode (ServiceNow/HuggingFace/NVIDIA) · 15B · BigCode OpenRAIL-M (commercial OK, responsible-use clauses)
📊 HumanEval 46.3, MBPP ~66; trained on 600+ languages, 4T+ tokens
~9-10GB VRAM (12GB/16GB GPU)
ollamaollama run starcoder2:15b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run starcoder2:15b
Hugging Face ↗ · ollama
StarCoder2-15B-Instruct-v0.1
9.1 GB
BigCode · 15B · BigCode OpenRAIL-M (commercial OK, responsible-use clauses)
📊 HumanEval 72.6 (surpasses CodeLlama-70B-Instruct's 72.0); fully self-aligned, no GPT distillation
~9-10GB VRAM (12GB/16GB GPU)
ollamaollama run hf.co/lmstudio-community/starcoder2-15b-instruct-v0.1-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/lmstudio-community/starcoder2-15b-instruct-v0.1-GGUF
Hugging Face ↗ · ollama
Granite-20B-Code-Instruct
12 GB
IBM · 20B · Apache-2.0
📊 HumanEvalSynthesize avg ~mid-30s; outperforms 2x-larger CodeLlama on instruct tasks
~12GB VRAM (16GB GPU)
ollamaollama run granite-code:20b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite-code:20b
Hugging Face ↗ · ollama
Codestral-22B-v0.1
13 GB
Mistral AI · 22.2B · Mistral AI Non-Production License (MNPL) — research/personal only, no production without commercial license
📊 HumanEval 81.1, MBPP 78.2, 80+ languages, native fill-in-the-middle
~13-16GB VRAM (16GB/24GB GPU)
ollamaollama run codestral:22b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codestral:22b
Hugging Face ↗ · ollama
Qwable-5-27B-Coder-GGUF
16.2 GB
DJLougen · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
ollamaollama run hf.co/DJLougen/Qwable-5-27B-Coder-GGUF
dockerdocker exec -it ollama ollama run hf.co/DJLougen/Qwable-5-27B-Coder-GGUF
Hugging Face ↗ · ollama
Qwen3-Coder-30B-A3B-Instruct-GGUF
18 GB
unsloth · 30B · apache-2.0 · discovered
~21GB VRAM (24GB GPU)
ollamaollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
dockerdocker exec -it ollama ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
Hugging Face ↗ · ollama
Granite-34B-Code-Instruct
19 GB
IBM · 34B · Apache-2.0
📊 HumanEvalSynthesize avg 41.9 (best of Granite-Code, near CodeLlama-70B-Instruct's 41.1)
~19GB VRAM (24GB GPU)
ollamaollama run granite-code:34b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite-code:34b
Hugging Face ↗ · ollama
DeepSeek-Coder-33B-Instruct
19 GB
DeepSeek · 33B · DeepSeek License (permits commercial use)
📊 HumanEval 79.3, MBPP 70.0; beats CodeLlama-34B by ~8pts, ~GPT-3.5-turbo level
~19GB VRAM (24GB GPU)
ollamaollama run deepseek-coder:33b-instruct
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-coder:33b-instruct
Hugging Face ↗ · ollama
CodeLlama-34B-Instruct
19 GB
Meta · 34B · Llama 2 Community License (commercial OK; >700M MAU must request Meta license)
📊 HumanEval 53.7 (base; instruct ~50), on par with original ChatGPT/GPT-3.5
~19GB VRAM (24GB GPU)
ollamaollama run codellama:34b-instruct
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codellama:34b-instruct
Hugging Face ↗ · ollama
Qwen2.5-Coder-32B-Instruct
20 GB
Alibaba (Qwen) · 32B · Apache-2.0
📊 HumanEval 92.7, MBPP 90.2, Aider 73.7, LiveCodeBench 31.4
~20GB VRAM (24GB GPU) or 32GB unified-memory Mac
ollamaollama run qwen2.5-coder:32b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:32b
Hugging Face ↗ · ollama
◉ Vision (VLM) 46
SmolVLM2 256M (Video) Instruct
0.5 GB
Hugging Face · 0.256B · Apache-2.0
📊 MMMU 29.0, DocVQA 58.3, OCRBench 52.6, TextVQA 49.9, Video-MME 33.7; smallest VLM in the world
<1GB VRAM, runs on CPU / in-browser (WebGPU)
transformersollama run hf.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct (GGUF community build) — or use transformers
dockerdocker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModelForImageTextToText, AutoProcessor; AutoModelForImageTextToText.from_pretrained('HuggingFaceTB/SmolVLM2-256M-Video-Instruct')"
Hugging Face ↗ · transformers
SmolVLM2 500M (Video) Instruct
1 GB
Hugging Face · 0.5B · Apache-2.0
📊 MMMU 33.7, DocVQA 70.5, Video-MME 42.2; near-2B quality at a fraction of size
~1GB VRAM, runs on CPU
transformersollama run hf.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct (GGUF community build) — or use transformers
dockerdocker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModelForImageTextToText; AutoModelForImageTextToText.from_pretrained('HuggingFaceTB/SmolVLM2-500M-Video-Instruct')"
Hugging Face ↗ · transformers
InternVL3 1B Instruct
1.2 GB
OpenGVLab (Shanghai AI Lab) · 1B · MIT (LLM component: Qwen2.5 license)
📊 MMMU ~43, DocVQA ~88, strong OCR; InternVL3 family tops out at MMMU 72.2 (78B)
~2GB VRAM, runs on CPU
transformersollama run hf.co/mradermacher/InternVL3-1B-GGUF:Q4_K_M (community) — or use transformers/lmdeploy
dockerdocker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModel; AutoModel.from_pretrained('OpenGVLab/InternVL3-1B', trust_remote_code=True)"
Hugging Face ↗ · transformers
InternVL3 2B Instruct
1.6 GB
OpenGVLab (Shanghai AI Lab) · 2B · MIT (LLM component: Qwen2.5 license)
📊 MMMU ~48, DocVQA ~89, ChartQA strong; HallusionBench improved over 2.5
~3GB VRAM, runs on CPU / 8GB laptop
transformersollama run hf.co/mradermacher/InternVL3-2B-GGUF:Q4_K_M (community) — or use transformers/lmdeploy
dockerdocker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModel; AutoModel.from_pretrained('OpenGVLab/InternVL3-2B', trust_remote_code=True)"
Hugging Face ↗ · transformers
Moondream2
1.7 GB
Vikhyat Korrapati (Moondream) · 1.9B · Apache-2.0
📊 VQAv2 78.1, GQA 59.0, TextVQA 44.1, DocVQA (newer builds) ~70; punches at 7B level for size
~2GB VRAM, runs easily on CPU / Raspberry-Pi-class
ollamaollama run moondream
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run moondream
Hugging Face ↗ · ollama
LocateAnything-3B
1.8 GB
nvidia · 3B · other · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model nvidia/LocateAnything-3B
Qwen3-VL 2B Instruct
1.9 GB
Alibaba Qwen · 2B · Apache-2.0
📊 MMMU ~57, strong OCR (32 languages), DocVQA ~92; current-gen (2025) successor to Qwen2.5-VL
~3GB VRAM, runs comfortably on CPU / 8GB laptop
ollamaollama run qwen3-vl:2b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3-vl:2b
Hugging Face ↗ · ollama
Qwen3.5-4B
2.4 GB
Qwen · 4B · apache-2.0 · discovered
~4GB VRAM, or CPU with 4GB RAM
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model Qwen/Qwen3.5-4B
Phi-3.5-vision Instruct
2.5 GB
Microsoft · 4.2B · MIT
📊 MMMU 43.0, MMBench 81.9, TextVQA 72.0, multi-frame/video summarization; 128K context
~5GB VRAM FP16 (~3GB Q4), runs on CPU
transformersollama run hf.co/SilverFishK/Phi-3.5-vision-instruct-GGUF (community GGUF; vision support varies) — transformers recommended
dockerdocker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModelForCausalLM, AutoProcessor; AutoModelForCausalLM.from_pretrained('microsoft/Phi-3.5-vision-instruct', trust_remote_code=True)"
Hugging Face ↗ · transformers
InternVL2.5 4B Instruct
2.8 GB
OpenGVLab (Shanghai AI Lab) · 4B · MIT (LLM component: based on Phi-3-mini / Qwen2)
📊 MMMU ~52, DocVQA ~91, OCRBench strong
~4GB VRAM, runs on CPU
transformersollama run hf.co/mradermacher/InternVL2_5-4B-GGUF:Q4_K_M (community) — or use transformers/lmdeploy
dockerdocker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModel; AutoModel.from_pretrained('OpenGVLab/InternVL2_5-4B', trust_remote_code=True)"
Hugging Face ↗ · transformers
Qwen2.5-VL 3B Instruct
3.2 GB
Alibaba Qwen · 3B · Qwen Research License (3B/7B research; non-commercial constraints)
📊 MMMU 53.1, DocVQA ~93, OCRBench strong
~4GB VRAM, runs on CPU / 8GB laptop
ollamaollama run qwen2.5vl:3b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5vl:3b
Hugging Face ↗ · ollama
Gemma 3 4B (vision)
3.3 GB
Google DeepMind · 4B · Gemma Terms of Use
📊 MMMU ~39, DocVQA ~73, TextVQA strong; 128K context, 140+ languages
~4GB VRAM, runs on CPU / 8GB laptop
ollamaollama run gemma3:4b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma3:4b
Hugging Face ↗ · ollama
Qwen3-VL 4B Instruct
3.3 GB
Alibaba Qwen · 4B · Apache-2.0
📊 MMMU ~63, DocVQA ~94, ChartQA strong, 32-language OCR; current-gen 2025
~5GB VRAM, runs on CPU
ollamaollama run qwen3-vl:4b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3-vl:4b
Hugging Face ↗ · ollama
Granite Vision 3.3 2B
3.6 GB
IBM · 2B · Apache-2.0
📊 DocVQA, ChartQA, AI2D, OCRBench rival/beat Llama 3.2 11B Vision & Pixtral 12B on enterprise doc tasks; tuned for visual document understanding
~4GB VRAM, runs on CPU
ollamaollama run ibm/granite3.3-vision:2b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run ibm/granite3.3-vision:2b
Hugging Face ↗ · ollama
LLaVA 1.6 (NeXT) 7B Mistral
4.5 GB
Liu et al. / LLaVA team · 7B · Apache-2.0 (Mistral base)
📊 MMMU 35.3, improved OCR/chart reading vs LLaVA-1.5; dynamic hi-res tiling up to 672x672
~6GB VRAM, runs on CPU
ollamaollama run llava:7b-v1.6-mistral-q4_K_S
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llava:7b-v1.6-mistral-q4_K_S
Hugging Face ↗ · ollama
SmolVLM 2.2B Instruct
4.5 GB
Hugging Face · 2.2B · Apache-2.0
📊 MMMU 42.0, DocVQA 80.0, TextVQA strong, Video-MME 52.1; best memory efficiency in class
~5GB VRAM FP16 (or ~2GB at Q4), runs on CPU
transformersollama run hf.co/HuggingFaceTB/SmolVLM-Instruct (GGUF community build) — or use transformers
dockerdocker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModelForImageTextToText; AutoModelForImageTextToText.from_pretrained('HuggingFaceTB/SmolVLM-Instruct')"
Hugging Face ↗ · transformers
Qwable-9B-Claude-Fable-5-GGUF
5.4 GB
empero-ai · 9B · apache-2.0 · discovered
~7GB VRAM, or CPU with 9GB RAM
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model empero-ai/Qwable-9B-Claude-Fable-5-GGUF
MiniCPM-V 2.6
5.5 GB
OpenBMB (Tsinghua) · 8B · MiniCPM Model License (free commercial use with registration)
📊 OpenCompass ~65, MMMU ~49, DocVQA ~90, OCRBench ~85 (SOTA among small models); GPT-4V-level multi-image & video
~7GB VRAM, runs on CPU; designed to run on phones
ollamaollama run minicpm-v
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run minicpm-v
Hugging Face ↗ · ollama
InternVL3 8B Instruct
5.5 GB
OpenGVLab (Shanghai AI Lab) · 8B · MIT (LLM component: Qwen2.5 license)
📊 MMMU ~73, DocVQA 92.7, ChartQA/InfoVQA strong, OCRBench ~88; among best open 8B VLMs
~8GB VRAM at Q4 (fits 12GB GPU); ~18GB FP16
vllmollama run hf.co/mradermacher/InternVL3-8B-GGUF:Q4_K_M (community) — or use lmdeploy/vLLM
dockerdocker run --runtime nvidia --gpus all -p 8000:8000 vllm/vllm-openai:latest --model OpenGVLab/InternVL3-8B --trust-remote-code
Qwen2.5-VL 7B Instruct
6 GB
Alibaba Qwen · 7B · Apache-2.0
📊 MMMU 58.6, DocVQA 95.7, ChartQA ~87, OCRBench ~86; beats Llama 3.2 11B Vision on most VQA
~7GB VRAM (fits 12GB GPU), runs on CPU
ollamaollama run qwen2.5vl:7b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5vl:7b
Hugging Face ↗ · ollama
Unlimited-OCR
6 GB
baidu · mit · discovered
~8GB VRAM (RTX 3090/4090)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model baidu/Unlimited-OCR
MiniMax-M3
6 GB
MiniMaxAI · other · discovered
~8GB VRAM (RTX 3090/4090)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model MiniMaxAI/MiniMax-M3
Kimi-K2.7-Code
6 GB
moonshotai · other · discovered
~8GB VRAM (RTX 3090/4090)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model moonshotai/Kimi-K2.7-Code
lift
6 GB
datalab-to · openrail · discovered
~8GB VRAM (RTX 3090/4090)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model datalab-to/lift
GLM-OCR
6 GB
zai-org · mit · discovered
~8GB VRAM (RTX 3090/4090)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model zai-org/GLM-OCR
Qwen3-VL 8B Instruct
6.1 GB
Alibaba Qwen · 8B · Apache-2.0
📊 MMMU ~69, DocVQA ~95, MathVista strong; Qwen3-VL family scores up to MMMU 80.6 at largest sizes
~8GB VRAM (fits 12GB GPU), runs on CPU slowly
ollamaollama run qwen3-vl:8b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3-vl:8b
Hugging Face ↗ · ollama
gemma-4-12b-it-GGUF
7.2 GB
unsloth · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model unsloth/gemma-4-12b-it-GGUF
Llama 3.2 Vision 11B Instruct
7.8 GB
Meta · 11B · Llama 3.2 Community License (gated; <700M MAU)
📊 MMMU 50.7, DocVQA 88.4, ChartQA ~83, AI2D ~91, VQAv2 ~75
~9GB VRAM (fits 12-16GB GPU), runs on CPU
ollamaollama run llama3.2-vision:11b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llama3.2-vision:11b
Hugging Face ↗ · ollama
LLaVA 1.6 (NeXT) 13B Vicuna
8 GB
Liu et al. / LLaVA team · 13B · LLaMA-2 Community License (Vicuna base) + Apache (LLaVA weights)
📊 MMMU ~36, MMBench ~70, better text-in-image reading than 1.5
~10GB VRAM (fits 12-16GB GPU), runs on CPU
ollamaollama run llava:13b-v1.6-vicuna-q4_K_S
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llava:13b-v1.6-vicuna-q4_K_S
Hugging Face ↗ · ollama
Pixtral 12B (2409)
8 GB
Mistral AI · 12B · Apache-2.0
📊 MMMU 52.5 (CoT), DocVQA 90.7 (ANLS), ChartQA ~82, VQAv2 ~78; strong multi-image
~9GB VRAM at Q4 (fits 12-16GB GPU); ~24GB FP16
vllmollama run hf.co/mradermacher/Pixtral-12B-2409-GGUF:Q4_K_M (community GGUF) — vLLM recommended
dockerdocker run --runtime nvidia --gpus all -p 8000:8000 vllm/vllm-openai:latest --model mistralai/Pixtral-12B-2409 --tokenizer-mode mistral --limit-mm-per-prompt 'image=4'
Gemma 3 12B (vision)
8.1 GB
Google DeepMind · 12B · Gemma Terms of Use
📊 MMMU 50.3, DocVQA 82.3, InfoVQA/ChartQA strong; 128K context
~9GB VRAM (fits 12-16GB GPU), runs on CPU
ollamaollama run gemma3:12b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma3:12b
Hugging Face ↗ · ollama
diffusiongemma-26B-A4B-it
15.6 GB
google · 26B · apache-2.0 · discovered
~18GB VRAM (RTX 3090/4090)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model google/diffusiongemma-26B-A4B-it
gemma-4-26B-A4B-it-qat-GGUF
15.6 GB
unsloth · 26B · apache-2.0 · discovered
~18GB VRAM (RTX 3090/4090)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model unsloth/gemma-4-26B-A4B-it-qat-GGUF
gemma-4-26B-A4B-it-GGUF
15.6 GB
unsloth · 26B · apache-2.0 · discovered
~18GB VRAM (RTX 3090/4090)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model unsloth/gemma-4-26B-A4B-it-GGUF
Qwopus3.6-27B-Coder-Compat-MTP-GGUF
16.2 GB
Jackrong · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model Jackrong/Qwopus3.6-27B-Coder-Compat-MTP-GGUF
Qwopus3.6-27B-Coder-MTP-GGUF
16.2 GB
Jackrong · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF
Qwen3.6-27B-MTP-GGUF
16.2 GB
unsloth · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model unsloth/Qwen3.6-27B-MTP-GGUF
Qwen3.6-27B-MTP-pi-reasoning-GGUF
16.2 GB
bytkim · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF
Gemma 3 27B (vision)
17 GB
Google DeepMind · 27B · Gemma Terms of Use
📊 MMMU 56.1, DocVQA 85.6, ChartQA/AI2D strong; competitive with much larger models
~18GB VRAM (fits 24GB GPU), CPU possible
ollamaollama run gemma3:27b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma3:27b
Hugging Face ↗ · ollama
gemma-4-31B-it
18.6 GB
google · 31B · apache-2.0 · discovered
~22GB VRAM (24GB GPU)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model google/gemma-4-31B-it
Qwen3-VL 30B-A3B Instruct (MoE)
20 GB
Alibaba Qwen · 30B · Apache-2.0
📊 MMMU ~73-75, DocVQA ~96; MoE with only ~3B active params so runs fast
~22GB VRAM (fits 24GB GPU at Q4); MoE keeps it fast on CPU
ollamaollama run qwen3-vl:30b-a3b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3-vl:30b-a3b
Hugging Face ↗ · ollama
LLaVA 1.6 (NeXT) 34B (Yi)
20 GB
Liu et al. / LLaVA team · 34B · Yi License (Apache-like, free commercial with registration)
📊 MMMU ~46, MMBench ~79; strongest LLaVA-NeXT tier
~22GB VRAM at Q4 (fits 24GB GPU)
ollamaollama run llava:34b-v1.6-q4_K_S
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llava:34b-v1.6-q4_K_S
Hugging Face ↗ · ollama
Qwen2.5-VL 32B Instruct
21 GB
Alibaba Qwen · 32B · Apache-2.0
📊 MMMU ~70, DocVQA ~94, MathVista strong; near 72B quality
~23GB VRAM at Q4 (fits 24GB GPU); CPU possible but slow
ollamaollama run qwen2.5vl:32b
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5vl:32b
Hugging Face ↗ · ollama
Qwen3.6-35B-A3B
21 GB
Qwen · 35B · apache-2.0 · discovered
~24GB VRAM (24GB GPU)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model Qwen/Qwen3.6-35B-A3B
Qwen3.6-35B-A3B-MTP-GGUF
21 GB
unsloth · 35B · apache-2.0 · discovered
~24GB VRAM (24GB GPU)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model unsloth/Qwen3.6-35B-A3B-MTP-GGUF
Qwen3.6-35B-A3B-StyleTune
21 GB
Gryphe · 35B · apache-2.0 · discovered
~24GB VRAM (24GB GPU)
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model Gryphe/Qwen3.6-35B-A3B-StyleTune
🎙 Speech-to-text 45
Moonshine Tiny
0.19 GB
Useful Sensors / Moonshine AI · 0.027B · MIT
📊 27M params; better average WER than Whisper tiny.en while ~5x faster. v2 Tiny hits ~50ms latency (5.8x faster than Whisper Tiny). English. Variable-length input (no fixed 30s padding) = big edge speedup.
<0.5GB; CPU/edge-first (designed for memory-constrained microcontrollers/SBCs)
transformerspip install useful-moonshine; python -c "import moonshine; print(moonshine.transcribe('audio.wav','moonshine/tiny'))" # ONNX: moonshine.transcribe_with_onnx
dockerdocker run -it -v $(pwd):/data python:3.11 bash -c "pip install useful-moonshine && python -c \"import moonshine; print(moonshine.transcribe('/data/audio.wav','moonshine/tiny'))\""
Hugging Face ↗ · transformers
Moonshine Base
0.237 GB
Useful Sensors / Moonshine AI · 0.061B · MIT
📊 61M params, 237MB on disk; beats Whisper base.en on average WER while running much faster on CPU. English. Variable-length encoder avoids Whisper's fixed-window overhead.
<1GB; CPU/edge-first
transformerspip install useful-moonshine; python -c "import moonshine; print(moonshine.transcribe('audio.wav','moonshine/base'))" # ONNX: moonshine.transcribe_with_onnx
dockerdocker run -it -v $(pwd):/data python:3.11 bash -c "pip install useful-moonshine && python -c \"import moonshine; print(moonshine.transcribe('/data/audio.wav','moonshine/base'))\""
Hugging Face ↗ · transformers
NVIDIA Parakeet TDT-CTC 110M
0.46 GB
NVIDIA · 0.11B · CC-BY-4.0
📊 Compact FastConformer hybrid TDT+CTC; competitive English WER for its size, very high RTFx. Good edge/streaming candidate.
~1GB VRAM; can run CPU
transformerspip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt_ctc-110m'); print(m.transcribe(['audio.wav'])[0].text)"
dockerdocker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt_ctc-110m'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗ · transformers
nemotron-3.5-asr-streaming-0.6b
0.6 GB
nvidia · 0.6B · other · discovered
runs on CPU / any laptop
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: nvidia/nemotron-3.5-asr-streaming-0.6b
Hugging Face ↗ · faster-whisper
ark-asr-0.6b-int8-onnx
0.6 GB
AutoArk-AI · 0.6B · apache-2.0 · discovered
runs on CPU / any laptop
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: AutoArk-AI/ark-asr-0.6b-int8-onnx
Hugging Face ↗ · faster-whisper
nemotron-speech-streaming-en-0.6b
0.6 GB
nvidia · 0.6B · other · discovered
runs on CPU / any laptop
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: nvidia/nemotron-speech-streaming-en-0.6b
Hugging Face ↗ · faster-whisper
Qwen3-ASR-0.6B
0.6 GB
Qwen · 0.6B · apache-2.0 · discovered
runs on CPU / any laptop
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: Qwen/Qwen3-ASR-0.6B
Hugging Face ↗ · faster-whisper
NVIDIA Canary-180m-Flash
0.73 GB
NVIDIA · 0.182B · CC-BY-4.0
📊 >1200 RTFx (extremely fast); 4 languages (en/de/fr/es) ASR + translation. Strong accuracy-per-param for a 182M model. Word-level timestamps.
~1-2GB VRAM; can run CPU
transformerspip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-180m-flash'); print(m.transcribe(['audio.wav'])[0].text)"
dockerdocker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-180m-flash'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗ · transformers
SenseVoice-Small
0.94 GB
FunAudioLLM (Alibaba) · 0.234B · Apache-2.0 (model: 'model-license', code Apache-2.0)
📊 Non-autoregressive; >5x faster than Whisper-Small and ~15x faster than Whisper-Large; latency <80ms. Beats Whisper on Chinese/Cantonese benchmarks (e.g. AISHELL-1). 50+ languages incl. zh/en/yue/ja/ko, plus emotion (SER) + audio-event detection (AED) + ITN.
~1-2GB VRAM; runs well on CPU
transformerspip install funasr; python -c "from funasr import AutoModel; m=AutoModel(model='FunAudioLLM/SenseVoiceSmall',hub='hf'); print(m.generate(input='audio.mp3',language='auto',use_itn=True)[0]['text'])"
dockerdocker run --gpus all -it -v $(pwd):/data registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:latest-cuda python -c "from funasr import AutoModel; m=AutoModel(model='FunAudioLLM/SenseVoiceSmall',hub='hf'); print(m.generate(input='/data/audio.mp3',language='auto',use_itn=True)[0]['text'])"
Hugging Face ↗ · transformers
speaker-diarization-3.1
1 GB
pyannote · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: pyannote/speaker-diarization-3.1
Hugging Face ↗ · faster-whisper
speaker-diarization-community-1
1 GB
pyannote · cc-by-4.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: pyannote/speaker-diarization-community-1
Hugging Face ↗ · faster-whisper
cohere-transcribe-03-2026
1 GB
CohereLabs · apache-2.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: CohereLabs/cohere-transcribe-03-2026
Hugging Face ↗ · faster-whisper
whisper.cpp
1 GB
ggerganov · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: ggerganov/whisper.cpp
Hugging Face ↗ · faster-whisper
VibeVoice-ASR
1 GB
microsoft · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: microsoft/VibeVoice-ASR
Hugging Face ↗ · faster-whisper
medasr
1 GB
google · other · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: google/medasr
Hugging Face ↗ · faster-whisper
GLM-ASR-Nano-2512
1 GB
zai-org · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: zai-org/GLM-ASR-Nano-2512
Hugging Face ↗ · faster-whisper
Fun-ASR-Nano-2512
1 GB
FunAudioLLM · apache-2.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: FunAudioLLM/Fun-ASR-Nano-2512
Hugging Face ↗ · faster-whisper
parakeet-cpp-gguf
1 GB
mudler · cc-by-4.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: mudler/parakeet-cpp-gguf
Hugging Face ↗ · faster-whisper
GigaAM-v3
1 GB
ai-sage · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: ai-sage/GigaAM-v3
Hugging Face ↗ · faster-whisper
fastconformer-quran-ar
1 GB
mohammed · cc-by-4.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: mohammed/fastconformer-quran-ar
Hugging Face ↗ · faster-whisper
whisper-hinglish-preview
1 GB
Trelis · apache-2.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: Trelis/whisper-hinglish-preview
Hugging Face ↗ · faster-whisper
kotoba-whisper-v2.2
1 GB
kotoba-tech · apache-2.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: kotoba-tech/kotoba-whisper-v2.2
Hugging Face ↗ · faster-whisper
anime-whisper
1 GB
litagin · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: litagin/anime-whisper
Hugging Face ↗ · faster-whisper
wav2vec2-XLS-R 300M (multilingual base)
1.2 GB
Meta (Facebook AI) · 0.3B · Apache-2.0
📊 Pretrained on 436k hrs across 128 languages. Not directly an ASR head — needs fine-tuning (CTC) per language; fine-tuned variants reach competitive multilingual WER (e.g. Common Voice). Foundation for many community ASR models.
~1-2GB VRAM; CPU works for inference
transformers# pretrained-only: fine-tune then run. pip install transformers torch; python -c "from transformers import pipeline; p=pipeline('automatic-speech-recognition','<your-finetuned-xls-r-300m>'); print(p('audio.wav')['text'])"
dockerdocker run --gpus all -it -v $(pwd):/data huggingface/transformers-pytorch-gpu python -c "from transformers import Wav2Vec2Model; m=Wav2Vec2Model.from_pretrained('facebook/wav2vec2-xls-r-300m'); print('loaded')"
Hugging Face ↗ · transformers
wav2vec2 large-960h-lv60-self (English)
1.26 GB
Meta (Facebook AI) · 0.317B · Apache-2.0
📊 1.8% / 3.3% WER on LibriSpeech test-clean / test-other (CTC, self-training on 960h + 53k unlabeled). English-only, no built-in punctuation. With 10 min labeled data still ~4.8/8.2 WER.
~1-2GB VRAM; runs on CPU
transformerspip install transformers torch torchaudio; python -c "from transformers import pipeline; p=pipeline('automatic-speech-recognition','facebook/wav2vec2-large-960h-lv60-self'); print(p('audio.wav')['text'])"
dockerdocker run --gpus all -it -v $(pwd):/data huggingface/transformers-pytorch-gpu python -c "from transformers import pipeline; p=pipeline('automatic-speech-recognition','facebook/wav2vec2-large-960h-lv60-self'); print(p('/data/audio.wav')['text'])"
Hugging Face ↗ · transformers
Whisper large-v3-turbo
1.5 GB
OpenAI · 0.809B · MIT
📊 ~3-4% WER LibriSpeech test-clean; only 0.3-0.7pt WER worse than large-v2 but ~6-8x faster (4 decoder layers vs 32). 99-language multilingual.
~4-6GB VRAM; q5_0 fits ~2GB; usable on CPU
whisper.cpp./download-ggml-model.sh large-v3-turbo && ./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav
dockerdocker run -it -v $(pwd):/audio ghcr.io/ggml-org/whisper.cpp:main "./build/bin/whisper-cli -m /models/ggml-large-v3-turbo.bin -f /audio/audio.wav"
Hugging Face ↗ · whisper.cpp
Distil-Whisper distil-large-v3
1.5 GB
Hugging Face · 0.756B · MIT
📊 Within 1.5% WER of large-v3 on OOD short-form, within 1% on long-form, +0.1% better on chunked long-form. ~6x faster than large-v3. English-only.
~2-3GB VRAM FP16; CPU usable via CT2/GGML
transformerspip install transformers torch; python -c "from transformers import pipeline; p=pipeline('automatic-speech-recognition','distil-whisper/distil-large-v3',torch_dtype='float16',device='cuda'); print(p('audio.wav')['text'])"
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # set model=Systran/faster-distil-whisper-large-v3 (CTranslate2 build)
Hugging Face ↗ · transformers
Distil-Whisper distil-large-v3.5
1.5 GB
Hugging Face · 0.756B · MIT
📊 Short-form 7.10 WER vs large-v3's 7.14 (slightly better); long-form 10.04 vs 8.82 (a bit worse). ~1.5x faster than large-v3-turbo on long-form. Trained on 98k hrs with patient teacher + SpecAugment.
~2-3GB VRAM FP16; CPU via CT2/ONNX builds
transformerspip install transformers torch; python -c "from transformers import pipeline; p=pipeline('automatic-speech-recognition','distil-whisper/distil-large-v3.5',torch_dtype='float16',device='cuda'); print(p('audio.wav')['text'])"
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # set model=distil-whisper/distil-large-v3.5-ct2
Hugging Face ↗ · transformers
whisper-large-v3
1.6 GB
openai · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: openai/whisper-large-v3
Hugging Face ↗ · faster-whisper
whisper-large-v3-turbo
1.6 GB
openai · mit · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: openai/whisper-large-v3-turbo
Hugging Face ↗ · faster-whisper
kazakh-whisper-large-v3-turbo
1.6 GB
shyngys879 · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: shyngys879/kazakh-whisper-large-v3-turbo
Hugging Face ↗ · faster-whisper
seamless-m4t-v2-large
1.6 GB
facebook · cc-by-nc-4.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: facebook/seamless-m4t-v2-large
Hugging Face ↗ · faster-whisper
faster-whisper large-v3-turbo (CTranslate2)
1.62 GB
SYSTRAN / deepdml / OpenAI weights · 0.809B · MIT
📊 Near large-v3-turbo quality (~3-4% WER LibriSpeech clean) at very high throughput; combines turbo's pruned decoder with CTranslate2 speedups. Sub-second latency feasible.
~1.5-2GB VRAM FP16; int8 <1GB; good on CPU
faster-whisperpip install faster-whisper; python -c "from faster_whisper import WhisperModel; m=WhisperModel('deepdml/faster-whisper-large-v3-turbo-ct2',device='cuda',compute_type='float16'); [print(s.text) for s,_ in [m.transcribe('audio.wav')][0]]"
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # set model=deepdml/faster-whisper-large-v3-turbo-ct2
Hugging Face ↗ · faster-whisper
Qwen3-ASR-1.7B
1.7 GB
Qwen · 1.7B · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: Qwen/Qwen3-ASR-1.7B
Hugging Face ↗ · faster-whisper
Kyutai STT 1B (en/fr, streaming)
2 GB
Kyutai · 1B · CC-BY-4.0
📊 Streaming STT with ~0.5s delay + semantic VAD; English & French. Word-level timestamps; robust to noise. Built on Mimi codec + Moshi-style autoregressive decoder. Trained on 2.5M hrs.
~2-4GB VRAM; designed for real-time streaming on GPU
transformerspip install moshi; python -m moshi.run_inference --hf-repo kyutai/stt-1b-en_fr audio.wav # or use transformers KyutaiSpeechToText
dockerdocker run --gpus all -it -v $(pwd):/data python:3.11 bash -c "pip install moshi && python -m moshi.run_inference --hf-repo kyutai/stt-1b-en_fr /data/audio.wav"
Hugging Face ↗ · transformers
granite-speech-4.1-2b
2 GB
ibm-granite · 2B · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: ibm-granite/granite-speech-4.1-2b
Hugging Face ↗ · faster-whisper
NVIDIA Parakeet TDT 0.6B v2 (English)
2.4 GB
NVIDIA · 0.6B · CC-BY-4.0
📊 Open ASR Leaderboard avg 6.05% WER (was #1 at release, May 2025). LibriSpeech test-clean 1.69%, test-other 3.19%. RTFx >3000 — transcribes ~1hr audio per second on GPU. English-only.
~2-4GB VRAM; needs NVIDIA GPU (CUDA) for best speed; CPU possible but slow
transformerspip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v2'); print(m.transcribe(['audio.wav'])[0].text)"
dockerdocker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v2'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗ · transformers
NVIDIA Parakeet TDT 0.6B v3 (multilingual)
2.4 GB
NVIDIA · 0.6B · CC-BY-4.0
📊 Open ASR Leaderboard avg 6.34% WER. LibriSpeech test-clean 1.93%. Multilingual Fleurs avg 11.97% WER across 25 European languages. RTFx >3000. Auto language detection.
~2-4GB VRAM; NVIDIA GPU recommended
transformerspip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v3'); print(m.transcribe(['audio.wav'])[0].text)"
dockerdocker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v3'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗ · transformers
ARK-ASR-3B
3 GB
AutoArk-AI · 3B · apache-2.0 · discovered
~4GB VRAM, or CPU with 5GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: AutoArk-AI/ARK-ASR-3B
Hugging Face ↗ · faster-whisper
faster-whisper large-v3 (CTranslate2)
3.09 GB
SYSTRAN / OpenAI weights · 1.54B · MIT
📊 Same accuracy as Whisper large-v3 (~1.8-2.7% WER LibriSpeech clean) but up to 4x faster and lower memory via CTranslate2. int8 quant adds speed with minimal WER loss.
~3GB VRAM FP16; int8 ~1.5-2GB; strong CPU performance
faster-whisperpip install faster-whisper; python -c "from faster_whisper import WhisperModel; m=WhisperModel('large-v3',device='cuda',compute_type='float16'); [print(s.text) for s,_ in [m.transcribe('audio.wav')][0]]"
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # OpenAI-compatible /v1/audio/transcriptions, set model=Systran/faster-whisper-large-v3
Hugging Face ↗ · faster-whisper
NVIDIA Canary-1B-Flash
3.5 GB
NVIDIA · 0.883B · CC-BY-4.0
📊 Avg WER ~6.67% on Open ASR Leaderboard; >1000 RTFx (much faster than original Canary-1B). 4 languages (en/de/fr/es) ASR + En<->X translation with optional punctuation/capitalization.
~3-5GB VRAM; NVIDIA GPU recommended
transformerspip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-1b-flash'); print(m.transcribe(['audio.wav'])[0].text)"
dockerdocker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-1b-flash'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗ · transformers
NVIDIA Canary-1B-v2 (multilingual ASR+AST)
4 GB
NVIDIA · 0.978B · CC-BY-4.0
📊 Top-tier on Open ASR Leaderboard (~5.6-6.7% WER region); 25 European languages, ASR + speech translation (X<->En). Encoder-decoder FastConformer + Transformer. Word-level timestamps.
~4-6GB VRAM; NVIDIA GPU recommended
transformerspip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-1b-v2'); print(m.transcribe(['audio.wav'])[0].text)"
dockerdocker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-1b-v2'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗ · transformers
Voxtral-Mini-4B-Realtime-2602
4 GB
mistralai · 4B · apache-2.0 · discovered
~5GB VRAM, or CPU with 6GB RAM
dockerdocker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: mistralai/Voxtral-Mini-4B-Realtime-2602
Hugging Face ↗ · faster-whisper
NVIDIA Parakeet TDT 1.1B
4.5 GB
NVIDIA · 1.1B · CC-BY-4.0
📊 ~6.0-6.5% avg WER region on Open ASR Leaderboard; trained on 64k+ hrs. Larger encoder than 0.6B for marginal accuracy gains. English.
~4-6GB VRAM; NVIDIA GPU recommended
transformerspip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-1.1b'); print(m.transcribe(['audio.wav'])[0].text)"
dockerdocker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-1.1b'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗ · transformers
Kyutai STT 2.6B (English, high accuracy)
9 GB
Kyutai · 2.6B · CC-BY-4.0
📊 ~6.4% WER; English-only, optimized for max accuracy with a 2.5s delay. Robust in noisy conditions and on audio up to ~2 hours. A H100 can serve ~400 streams in real-time.
~6-9GB VRAM (bf16); GPU recommended; MLX build runs on Apple Silicon
transformerspip install moshi; python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio.wav # or transformers KyutaiSpeechToText
dockerdocker run --gpus all -it -v $(pwd):/data python:3.11 bash -c "pip install moshi && python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en /data/audio.wav"
Hugging Face ↗ · transformers
🔊 Text-to-speech 38
Kitten-TTS Nano 0.1 (int8)
0.025 GB
KittenML · 0.015B · Apache-2.0
📊 No published MOS; positioned as 'SoTA under 25MB'. 24kHz output, 8 voices. Real-time on CPU including phones/Raspberry Pi.
0GB VRAM, CPU-only (runs on phones / <1GB RAM)
transformerspip install kittentts soundfile && python -c "from kittentts import KittenTTS; import soundfile as sf; m=KittenTTS('KittenML/kitten-tts-nano-0.1'); sf.write('out.wav', m.generate('Hello world', voice='expr-voice-2-f'), 24000)"
dockerdocker run -d -p 8000:8000 ghcr.io/devnen/kitten-tts-server:latest # devnen/Kitten-TTS-Server, Web UI + OpenAI-compatible API
Hugging Face ↗ · transformers
Piper (e.g. en_US-lessac-medium)
0.06 GB
Rhasspy / Open Home Foundation · 0.015B · MIT
📊 No formal MOS; VITS-based; ~10x real-time on desktop CPU, real-time on Raspberry Pi 5. Medium voices 22.05kHz, high 22.05kHz.
0GB VRAM, CPU-only by design; tiny RAM footprint
piperpip install piper-tts && echo 'Hello world' | piper -m en_US-lessac-medium.onnx -f out.wav # download voices from https://huggingface.co/rhasspy/piper-voices
dockerdocker run --rm -v $PWD:/data -e PIPER_VOICE=en_US-lessac-medium lscr.io/linuxserver/piper:latest # or rhasspy/wyoming-piper
Hugging Face ↗ · piper
MeloTTS (English v3)
0.21 GB
MyShell.ai + MIT · 0.05B · MIT
📊 VITS-based; CPU real-time capable. No formal MOS published but widely used; clear, natural multilingual speech. ~44.1kHz internal.
~1GB VRAM; fast CPU real-time inference
transformerspip install git+https://github.com/myshell-ai/MeloTTS.git && python -m unidic download && python -c "from melo.api import TTS; t=TTS(language='EN', device='cpu'); t.tts_to_file('Hello world', t.hps.data.spk2id['EN-US'], 'out.wav')"
dockerdocker run -d -p 8888:8888 --gpus all ghcr.io/myshell-ai/melotts:latest # official MeloTTS image with web UI
Hugging Face ↗ · transformers
Kokoro-82M (v1.0)
0.33 GB
hexgrad · 0.082B · Apache-2.0
📊 Was #1 in TTS Spaces Arena (Dec 2024) at only 82M params, beating much larger models on naturalness ELO. ~24kHz. Sub-real-time on CPU, very fast on GPU.
~1GB VRAM; runs comfortably on CPU
transformerspip install -q 'kokoro>=0.9.2' soundfile && python -c "from kokoro import KPipeline; import soundfile as sf; p=KPipeline(lang_code='a'); g=p('Hello world', voice='af_heart'); [sf.write(f'{i}.wav', a, 24000) for i,(_,_,a) in enumerate(g)]"
dockerdocker run -d -p 8880:8880 ghcr.io/remsky/kokoro-fastapi:latest # remsky/Kokoro-FastAPI, OpenAI-compatible /v1/audio/speech
Hugging Face ↗ · transformers
OpenVoice V2
0.4 GB
MyShell.ai + MIT · 0.1B · MIT
📊 Tone-color conversion step <100ms; instant zero-shot voice cloning. Quality inherits from MeloTTS base. No single MOS, but strong cross-lingual cloning fidelity.
~2GB VRAM; runs on CPU
transformerspip install git+https://github.com/myshell-ai/OpenVoice.git && python -c "from openvoice.api import ToneColorConverter" # MeloTTS base + tone-color converter; clone from ~6s reference
dockerdocker run -d -p 8000:8000 --gpus all ghcr.io/myshell-ai/openvoice:v2 # or any python:3.10 image with the repo installed
Hugging Face ↗ · transformers
Inflect-Nano-v1
0.5 GB
owensong · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load owensong/Inflect-Nano-v1 (see model card for TTS pipeline)
Hugging Face ↗ · transformers
MOSS-TTS-Local-Transformer-v1.5
0.5 GB
OpenMOSS-Team · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 (see model card for TTS pipeline)
Hugging Face ↗ · transformers
OmniVoice
0.5 GB
k2-fsa · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load k2-fsa/OmniVoice (see model card for TTS pipeline)
Hugging Face ↗ · transformers
ZONOS2
0.5 GB
Zyphra · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load Zyphra/ZONOS2 (see model card for TTS pipeline)
Hugging Face ↗ · transformers
VoxCPM2
0.5 GB
openbmb · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load openbmb/VoxCPM2 (see model card for TTS pipeline)
Hugging Face ↗ · transformers
dots.tts-soar
0.5 GB
rednote-hilab · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load rednote-hilab/dots.tts-soar (see model card for TTS pipeline)
Hugging Face ↗ · transformers
supertonic-3
0.5 GB
Supertone · openrail · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load Supertone/supertonic-3 (see model card for TTS pipeline)
Hugging Face ↗ · transformers
s2-pro
0.5 GB
fishaudio · other · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load fishaudio/s2-pro (see model card for TTS pipeline)
Hugging Face ↗ · transformers
GPA-v1.5
0.5 GB
AutoArk-AI · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load AutoArk-AI/GPA-v1.5 (see model card for TTS pipeline)
Hugging Face ↗ · transformers
Fun-CosyVoice3-0.5B-2512
0.5 GB
FunAudioLLM · 0.5B · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load FunAudioLLM/Fun-CosyVoice3-0.5B-2512 (see model card for TTS pipeline)
Hugging Face ↗ · transformers
GPA
0.5 GB
AutoArk-AI · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load AutoArk-AI/GPA (see model card for TTS pipeline)
Hugging Face ↗ · transformers
GPA-v1.5-onnx-runtime
0.5 GB
AutoArk-AI · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load AutoArk-AI/GPA-v1.5-onnx-runtime (see model card for TTS pipeline)
Hugging Face ↗ · transformers
MisoTTS
0.5 GB
MisoLabs · other · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load MisoLabs/MisoTTS (see model card for TTS pipeline)
Hugging Face ↗ · transformers
Dramabox
0.5 GB
ResembleAI · other · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load ResembleAI/Dramabox (see model card for TTS pipeline)
Hugging Face ↗ · transformers
Kokoro-Vietnamese
0.5 GB
contextboxai · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load contextboxai/Kokoro-Vietnamese (see model card for TTS pipeline)
Hugging Face ↗ · transformers
MOSS-TTS-v1.5
0.5 GB
OpenMOSS-Team · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load OpenMOSS-Team/MOSS-TTS-v1.5 (see model card for TTS pipeline)
Hugging Face ↗ · transformers
VoiceTut-TTS
0.5 GB
mohammedaly22 · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load mohammedaly22/VoiceTut-TTS (see model card for TTS pipeline)
Hugging Face ↗ · transformers
BlueMagpie-TTS
0.5 GB
OpenFormosa · other · discovered
runs on CPU / any laptop
dockerpip install transformers torch # load OpenFormosa/BlueMagpie-TTS (see model card for TTS pipeline)
Hugging Face ↗ · transformers
StyleTTS 2 (LibriTTS)
0.78 GB
Yinghao Aaron Li (Columbia) / yl4579 · 0.2B · MIT
📊 LJSpeech MOS-N 4.55 vs 4.23 ground-truth (surpasses human recordings single-speaker); matches human on multispeaker VCTK. 24kHz. Diffusion-style prosody.
~2-4GB VRAM; CPU usable but slow
transformerspip install styletts2 && python -c "from styletts2 import tts; t=tts.StyleTTS2(); t.inference('Hello world', output_wav_file='out.wav')" # LJSpeech checkpoint: yl4579/StyleTTS2-LJSpeech
dockerdocker run --rm -v $PWD:/work --gpus all python:3.10 bash -lc 'pip install styletts2 && python -c "from styletts2 import tts; tts.StyleTTS2().inference(\"Hi\", output_wav_file=\"/work/out.wav\")"'
Hugging Face ↗ · transformers
F5-TTS (v1 Base)
1.35 GB
SWivid (Shanghai Jiao Tong Univ.) · 0.336B · CC-BY-NC-4.0 (weights) / Apache-2.0 for OpenF5-TTS
📊 Flow-matching (non-autoregressive, no diffusion) -> fast inference + strong prosody. ~0.15-0.3 RTF on GPU. Excellent zero-shot cloning + code-switching; among the top open cloning models of 2024-25. 24kHz.
~2-4GB VRAM; CPU usable
transformerspip install f5-tts && f5-tts_infer-cli --model F5TTS_v1_Base --ref_audio ref.wav --ref_text 'reference transcript' --gen_text 'Hello world' # or: f5-tts_infer-gradio for web UI
dockerdocker run -d -p 7860:7860 --gpus all ghcr.io/swivid/f5-tts:main # official image, launches Gradio UI
Hugging Face ↗ · transformers
VibeVoice-1.5B
1.5 GB
microsoft · 1.5B · mit · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerpip install transformers torch # load microsoft/VibeVoice-1.5B (see model card for TTS pipeline)
Hugging Face ↗ · transformers
Qwen3-TTS-12Hz-1.7B-CustomVoice
1.7 GB
Qwen · 1.7B · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerpip install transformers torch # load Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice (see model card for TTS pipeline)
Hugging Face ↗ · transformers
IndexTTS-2
1.8 GB
IndexTeam · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerpip install transformers torch # load IndexTeam/IndexTTS-2 (see model card for TTS pipeline)
Hugging Face ↗ · transformers
Chatterbox (Multilingual v3)
2 GB
Resemble AI · 0.5B · MIT
📊 Vendor blind test: 65.3% preferred Chatterbox-Turbo vs 24.5% ElevenLabs (take with salt). Sub-200ms latency (Turbo ~472ms first chunk, RTF ~0.5). First open model with emotion-exaggeration control. Trained on 500K hrs.
~4-6GB VRAM (0.5B); Turbo 350M lighter; CPU supported
transformerspip install chatterbox-tts && python -c "import torchaudio as ta; from chatterbox.tts import ChatterboxTTS; m=ChatterboxTTS.from_pretrained(device='cuda'); ta.save('out.wav', m.generate('Hello world', audio_prompt_path='ref.wav'), m.sr)"
dockerdocker run -d -p 8004:8004 --gpus all ghcr.io/devnen/chatterbox-tts-server:latest # devnen/Chatterbox-TTS-Server: Web UI + OpenAI-compatible API, CUDA/ROCm/CPU
Hugging Face ↗ · transformers
Sesame CSM-1B
2.1 GB
Sesame AI Labs · 1B · Apache-2.0
📊 Conversational/contextual prosody (uses prior turns of text+audio). Llama backbone + Mimi RVQ decoder. ~200ms-class streaming. Strong context-aware naturalness; no single MOS published.
~4-6GB VRAM (bf16); GGUF runs smaller / CPU
transformerspip install transformers torch soundfile && python -c "from transformers import CsmForConditionalGeneration, AutoProcessor; import torch, soundfile as sf; m=CsmForConditionalGeneration.from_pretrained('sesame/csm-1b'); p=AutoProcessor.from_pretrained('sesame/csm-1b')" # gated: huggingface-cli login first
dockerdocker run --rm --gpus all -v $PWD:/work huggingface/transformers-pytorch-gpu:latest python /work/csm_infer.py # GGUF: ggml-org/sesame-csm-1b-GGUF via llama.cpp
Hugging Face ↗ · transformers
Coqui XTTS-v2
2.1 GB
Coqui (community-maintained) · 0.5B · Coqui Public Model License (CPML)
📊 6-second zero-shot voice cloning, 17 languages, cross-lingual + emotion/style transfer, 24kHz. Long the community favorite for quality cloning; ~150-200ms streaming latency on GPU.
~2-3GB VRAM (FP16); CPU works but slow
transformerspip install coqui-tts && python -c "from TTS.api import TTS; TTS('tts_models/multilingual/multi-dataset/xtts_v2').tts_to_file(text='Hello world', speaker_wav='ref.wav', language='en', file_path='out.wav')" # coqui-tts is the maintained fork of the TTS package
dockerdocker run -d -p 8020:8020 --gpus all ghcr.io/coqui-ai/xtts-streaming-server:latest # official XTTS streaming server, OpenAI-ish API
Hugging Face ↗ · transformers
Orpheus-TTS 3B (finetuned)
2.3 GB
Canopy Labs · 3B · Apache-2.0
📊 Llama-3.2-3B Speech-LLM, trained 100K+ hrs English. ~200ms streaming latency (down to ~100ms with input streaming). Zero-shot cloning + inline emotion tags (<laugh>,<sigh>,<gasp>...). Claims to rival/surpass closed-source naturalness.
Q4 ~3-4GB VRAM; bf16 ~8GB; CPU via GGUF/llama.cpp
ollamaollama run hf.co/isaiahbjork/orpheus-3b-0.1-ft-Q4_K_M-GGUF # or library mirror: ollama run legraphista/Orpheus:3b-ft-q4_k_m (Ollama emits SNAC audio tokens -> decode with orpheus-speech)
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/isaiahbjork/orpheus-3b-0.1-ft-Q4_K_M-GGUF # full audio: vllm serve canopylabs/orpheus-3b-0.1-ft
Hugging Face ↗ · ollama
Parler-TTS Mini v1
2.5 GB
Hugging Face · 0.88B · Apache-2.0
📊 Trained on 45K hours. Natural, controllable speech; you steer gender/pitch/pace/reverb/emotion with a natural-language description prompt. No headline MOS but fully reproducible (data+code+weights open).
~4GB VRAM; CPU possible, slow
transformerspip install git+https://github.com/huggingface/parler-tts.git && python -c "from parler_tts import ParlerTTSForConditionalGeneration as M; from transformers import AutoTokenizer; import soundfile as sf; m=M.from_pretrained('parler-tts/parler-tts-mini-v1'); t=AutoTokenizer.from_pretrained('parler-tts/parler-tts-mini-v1')" # describe voice via text prompt
dockerdocker run --rm --gpus all -v $PWD:/work huggingface/transformers-pytorch-gpu:latest bash -lc 'pip install git+https://github.com/huggingface/parler-tts.git && python /work/parler.py'
Hugging Face ↗ · transformers
higgs-audio-v3-tts-4b
4 GB
bosonai · 4B · other · discovered
~5GB VRAM, or CPU with 6GB RAM
dockerpip install transformers torch # load bosonai/higgs-audio-v3-tts-4b (see model card for TTS pipeline)
Hugging Face ↗ · transformers
Ming-omni-tts-16.8B-A3B
4 GB
inclusionAI · 16.8B · apache-2.0 · discovered
~5GB VRAM, or CPU with 6GB RAM
dockerpip install transformers torch # load inclusionAI/Ming-omni-tts-16.8B-A3B (see model card for TTS pipeline)
Hugging Face ↗ · transformers
Voxtral-4B-TTS-2603
4 GB
mistralai · 4B · cc-by-nc-4.0 · discovered
~5GB VRAM, or CPU with 6GB RAM
dockerpip install transformers torch # load mistralai/Voxtral-4B-TTS-2603 (see model card for TTS pipeline)
Hugging Face ↗ · transformers
higgs-audio-v3-tts-4b-transformers
4 GB
multimodalart · 4B · other · discovered
~5GB VRAM, or CPU with 6GB RAM
dockerpip install transformers torch # load multimodalart/higgs-audio-v3-tts-4b-transformers (see model card for TTS pipeline)
Hugging Face ↗ · transformers
Dia-1.6B
6.4 GB
Nari Labs · 1.6B · Apache-2.0
📊 Specialized for ultra-realistic multi-speaker DIALOGUE in one pass; handles nonverbals (laughs, coughs, throat-clear). Real-time on enterprise GPUs (~40 tok/s on A4000). 44.1kHz. Audio-conditioned emotion/tone + voice cloning from <=10s clip.
~10GB VRAM full (fits 25GB easily); bf16/int8 lowers it
transformerspip install git+https://github.com/nari-labs/dia.git && python -c "from dia.model import Dia; m=Dia.from_pretrained('nari-labs/Dia-1.6B'); import soundfile as sf; sf.write('out.wav', m.generate('[S1] Hello. [S2] Hi there! (laughs)'), 44100)" # also in HF Transformers (DiaForConditionalGeneration)
dockerdocker run --rm --gpus all -v $PWD:/work huggingface/transformers-pytorch-gpu:latest python /work/dia_infer.py # requires PyTorch 2.0+ / CUDA 12.6
Hugging Face ↗ · transformers
◇ Embeddings & rerank 59
snowflake-arctic-embed-xs (v1)
0.046 GB
Snowflake · 0.022B · Apache-2.0
📊 Smallest Arctic; dim 384
<0.2GB VRAM, runs anywhere incl. CPU/edge
ollamaollama run snowflake-arctic-embed:22m
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run snowflake-arctic-embed:22m
Hugging Face ↗ · ollama
all-MiniLM-L6-v2
0.046 GB
sentence-transformers (UKPLab) · 0.022B · Apache-2.0
📊 MTEB (English v1) ~56.3 avg; the classic fast/CPU baseline
<0.2GB VRAM, extremely fast on CPU
ollamaollama run all-minilm:l6
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull all-minilm
Hugging Face ↗ · ollama
granite-embedding-30m-english
0.063 GB
IBM · 0.03B · Apache-2.0
📊 Fast English retrieval, tiny footprint
<0.2GB VRAM, very fast on CPU
ollamaollama run granite-embedding:30m
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull granite-embedding:30m
Hugging Face ↗ · ollama
GTE-small
0.067 GB
Alibaba-NLP (thenlper) · 0.033B · MIT
📊 MTEB (English v1) ~61.4 avg
<0.3GB VRAM, very fast on CPU
sentence-transformersollama run hf.co/ChristianAzinn/gte-small-gguf
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/ChristianAzinn/gte-small-gguf
Hugging Face ↗ · sentence-transformers
e5-small-v2
0.067 GB
Microsoft (intfloat) · 0.033B · MIT
📊 MTEB (English v1) ~59.9 avg
<0.3GB VRAM, very fast on CPU
sentence-transformersollama run hf.co/yixuan-chia/e5-small-v2-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/e5-small-v2-GGUF
Hugging Face ↗ · sentence-transformers
snowflake-arctic-embed-s (v1)
0.067 GB
Snowflake · 0.033B · Apache-2.0
📊 Compact English retrieval, dim 384
<0.3GB VRAM, very fast on CPU
ollamaollama run snowflake-arctic-embed:33m
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run snowflake-arctic-embed:33m
Hugging Face ↗ · ollama
BGE-small-en-v1.5
0.07 GB
BAAI · 0.033B · MIT
📊 MTEB (English v1) ~62.2 avg — punches above its size
<0.3GB VRAM, very fast on CPU
sentence-transformersollama run hf.co/CompendiumLabs/bge-small-en-v1.5-gguf
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/CompendiumLabs/bge-small-en-v1.5-gguf
Hugging Face ↗ · sentence-transformers
paraphrase-multilingual-MiniLM-L12-v2
0.13 GB
sentence-transformers · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install sentence-transformers # SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
Hugging Face ↗ · sentence-transformers
all-MiniLM-L12-v2
0.13 GB
sentence-transformers · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install sentence-transformers # SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
Hugging Face ↗ · sentence-transformers
multi-modal-embed-small
0.13 GB
llm-semantic-router · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install sentence-transformers # SentenceTransformer("llm-semantic-router/multi-modal-embed-small")
Hugging Face ↗ · sentence-transformers
snowflake-arctic-embed-m (v1)
0.219 GB
Snowflake · 0.11B · Apache-2.0
📊 MTEB retrieval ~54.9 nDCG@10 (English)
~0.5GB VRAM, fast on CPU
ollamaollama run snowflake-arctic-embed:110m
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run snowflake-arctic-embed:110m
Hugging Face ↗ · ollama
BGE-base-en-v1.5
0.22 GB
BAAI · 0.109B · MIT
📊 MTEB (English v1) ~63.5 avg
~0.5GB VRAM, fast on CPU
sentence-transformersollama run hf.co/CompendiumLabs/bge-base-en-v1.5-gguf
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/CompendiumLabs/bge-base-en-v1.5-gguf
Hugging Face ↗ · sentence-transformers
e5-base-v2
0.22 GB
Microsoft (intfloat) · 0.109B · MIT
📊 MTEB (English v1) ~61.5 avg
~0.5GB VRAM, fast on CPU
sentence-transformersollama run hf.co/yixuan-chia/e5-base-v2-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/e5-base-v2-GGUF
Hugging Face ↗ · sentence-transformers
all-mpnet-base-v2
0.22 GB
sentence-transformers (UKPLab) · 0.109B · Apache-2.0
📊 MTEB (English v1) ~57.8 avg — long the best general-purpose ST model
~0.5GB VRAM, fast on CPU
sentence-transformersollama run hf.co/sentence-transformers/all-mpnet-base-v2
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/sentence-transformers/all-mpnet-base-v2
Hugging Face ↗ · sentence-transformers
multilingual-e5-small
0.24 GB
Microsoft (intfloat) · 0.118B · MIT
📊 Good multilingual quality for 118M params
<0.4GB VRAM, very fast on CPU
sentence-transformersollama run hf.co/yixuan-chia/multilingual-e5-small-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/multilingual-e5-small-GGUF
Hugging Face ↗ · sentence-transformers
nomic-embed-text-v1.5
0.274 GB
Nomic AI · 0.137B · Apache-2.0
📊 Beats OpenAI text-embedding-ada-002 & 3-small on short+long context; MTEB ~62
~0.5GB VRAM (522MB), runs on CPU
ollamaollama run nomic-embed-text:v1.5
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull nomic-embed-text
Hugging Face ↗ · ollama
GTE-base-en-v1.5
0.28 GB
Alibaba-NLP · 0.137B · Apache-2.0
📊 MTEB (English v1) ~64 avg
~0.6GB VRAM, fast on CPU
sentence-transformersollama run hf.co/ChristianAzinn/gte-base-en-v1.5-gguf
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/ChristianAzinn/gte-base-en-v1.5-gguf
Hugging Face ↗ · sentence-transformers
granite-embedding-r2 (english, 149m)
0.3 GB
IBM · 0.149B · Apache-2.0
📊 2025 R2 release; improved retrieval over r1, longer context
~0.5GB VRAM, fast on CPU
sentence-transformersollama run hf.co/ibm-granite/granite-embedding-english-r2
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/ibm-granite/granite-embedding-english-r2
Hugging Face ↗ · sentence-transformers
LFM2.5-Embedding-350M
0.5 GB
LiquidAI · other · discovered
runs on CPU / any laptop
dockerpip install sentence-transformers # SentenceTransformer("LiquidAI/LFM2.5-Embedding-350M")
Hugging Face ↗ · sentence-transformers
LFM2.5-ColBERT-350M
0.5 GB
LiquidAI · other · discovered
runs on CPU / any laptop
dockerpip install sentence-transformers # SentenceTransformer("LiquidAI/LFM2.5-ColBERT-350M")
Hugging Face ↗ · sentence-transformers
LateOn-regularized
0.5 GB
lightonai · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install sentence-transformers # SentenceTransformer("lightonai/LateOn-regularized")
Hugging Face ↗ · sentence-transformers
ruri-v3-310m
0.5 GB
cl-nagoya · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install sentence-transformers # SentenceTransformer("cl-nagoya/ruri-v3-310m")
Hugging Face ↗ · sentence-transformers
GTE-ModernColBERT-v1
0.5 GB
lightonai · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install sentence-transformers # SentenceTransformer("lightonai/GTE-ModernColBERT-v1")
Hugging Face ↗ · sentence-transformers
nomic-embed-text-v1
0.5 GB
nomic-ai · apache-2.0 · discovered
runs on CPU / any laptop
dockerpip install sentence-transformers # SentenceTransformer("nomic-ai/nomic-embed-text-v1")
Hugging Face ↗ · sentence-transformers
LFM2-ColBERT-350M
0.5 GB
LiquidAI · other · discovered
runs on CPU / any laptop
dockerpip install sentence-transformers # SentenceTransformer("LiquidAI/LFM2-ColBERT-350M")
Hugging Face ↗ · sentence-transformers
multilingual-e5-base
0.56 GB
Microsoft (intfloat) · 0.278B · MIT
📊 Solid multilingual MTEB, mid-size
~0.6GB VRAM, fast on CPU
sentence-transformersollama run hf.co/yixuan-chia/multilingual-e5-base-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/multilingual-e5-base-GGUF
Hugging Face ↗ · sentence-transformers
jina-reranker-v2-base-multilingual
0.56 GB
Jina AI · 0.278B · CC-BY-NC-4.0 (non-commercial)
📊 Fast multilingual cross-encoder; strong BEIR/MKQA; agentic function-calling rerank
~1GB VRAM, runs on CPU
sentence-transformersollama run hf.co/gpustack/jina-reranker-v2-base-multilingual-GGUF
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai --model jinaai/jina-reranker-v2-base-multilingual --task score
Hugging Face ↗ · sentence-transformers
granite-embedding-278m-multilingual
0.563 GB
IBM · 0.278B · Apache-2.0
📊 Competitive multilingual retrieval; enterprise/clean-data trained
~0.7GB VRAM, fast on CPU
ollamaollama run granite-embedding:278m
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull granite-embedding:278m
Hugging Face ↗ · ollama
snowflake-arctic-embed-m-v2.0
0.61 GB
Snowflake · 0.305B · Apache-2.0
📊 Strong multilingual retrieval, smaller footprint than L-v2.0
~0.8GB VRAM, fast on CPU
sentence-transformersollama run hf.co/Snowflake/snowflake-arctic-embed-m-v2.0
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/Snowflake/snowflake-arctic-embed-m-v2.0
Hugging Face ↗ · sentence-transformers
EmbeddingGemma-300m
0.62 GB
Google DeepMind · 0.308B · Gemma Terms of Use
📊 Highest-ranked open multilingual embedder under 500M on MMTEB at release (Sep 2025)
~0.6GB VRAM, runs on CPU/mobile
ollamaollama run embeddinggemma
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull embeddinggemma
Hugging Face ↗ · ollama
Qwen3-Embedding-0.6B
0.64 GB
Alibaba Qwen · 0.6B · Apache-2.0
📊 MTEB Multilingual mean 64.33; MTEB-Code strong; instruction-aware
~1GB VRAM, runs easily on CPU
ollamaollama run dengcao/Qwen3-Embedding-0.6B:Q8_0
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run dengcao/Qwen3-Embedding-0.6B:Q8_0
Hugging Face ↗ · ollama
Qwen3-Reranker-0.6B
0.64 GB
Alibaba Qwen · 0.6B · Apache-2.0
📊 Cross-encoder reranker; strong MTEB-R / MIRACL reranking gains; instruction-aware
~1GB VRAM, runs on CPU
transformersollama run hf.co/Mungert/Qwen3-Reranker-0.6B-GGUF
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai --model Qwen/Qwen3-Reranker-0.6B
Hugging Face ↗ · transformers
snowflake-arctic-embed-l (v1.5)
0.669 GB
Snowflake · 0.335B · Apache-2.0
📊 MTEB retrieval ~55.9 nDCG@10 (English) at release (Apr 2024)
~1.5GB VRAM, runs on CPU
ollamaollama run snowflake-arctic-embed:335m
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run snowflake-arctic-embed:335m
Hugging Face ↗ · ollama
BGE-large-en-v1.5
0.67 GB
BAAI · 0.335B · MIT
📊 MTEB (English v1) ~64.2 avg; long the default RAG baseline
~1.5GB VRAM, runs on CPU
ollamaollama run znbang/bge:large-en-v1.5-f16
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run znbang/bge:large-en-v1.5-f16
Hugging Face ↗ · ollama
e5-large-v2
0.67 GB
Microsoft (intfloat) · 0.335B · MIT
📊 MTEB (English v1) ~62.3 avg
~1.5GB VRAM, runs on CPU
sentence-transformersollama run hf.co/yixuan-chia/e5-large-v2-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/e5-large-v2-GGUF
Hugging Face ↗ · sentence-transformers
mxbai-embed-large-v1
0.67 GB
Mixedbread AI · 0.335B · Apache-2.0
📊 MTEB (English v1) ~64.7 avg — SOTA for BERT-large size at release (Mar 2024), no MTEB-data overlap
~1.5GB VRAM, runs on CPU
ollamaollama run mxbai-embed-large:v1
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull mxbai-embed-large
Hugging Face ↗ · ollama
GTE-large-en-v1.5
0.87 GB
Alibaba-NLP · 0.434B · Apache-2.0
📊 MTEB (English v1) ~65 avg — SOTA in its size class at release
~1.5GB VRAM, runs on CPU
sentence-transformersollama run hf.co/ChristianAzinn/gte-large-en-v1.5-gguf
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/ChristianAzinn/gte-large-en-v1.5-gguf
Hugging Face ↗ · sentence-transformers
stella_en_400M_v5
0.87 GB
NovaSearch (dunzhang) · 0.435B · MIT
📊 MTEB (English v1) ~70 avg — top small model; near 1.5B quality
~1.5GB VRAM, runs on CPU
sentence-transformersollama run hf.co/dunzhang/stella_en_400M_v5
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/dunzhang/stella_en_400M_v5
Hugging Face ↗ · sentence-transformers
nomic-embed-text-v2-moe
0.94 GB
Nomic AI · 0.475B · Apache-2.0
📊 Multilingual MoE; competitive multilingual MTEB at ~305M active params
~1GB VRAM, runs on CPU
ollamaollama run nomic-embed-text-v2-moe
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull nomic-embed-text-v2-moe
Hugging Face ↗ · ollama
bge-reranker-base
1.1 GB
BAAI · 0.278B · MIT
📊 XLM-RoBERTa-base cross-encoder; solid CN/EN reranking
~1GB VRAM, runs on CPU
sentence-transformersollama run hf.co/gpustack/bge-reranker-base-GGUF
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai --model BAAI/bge-reranker-base --task score
Hugging Face ↗ · sentence-transformers
multilingual-e5-large
1.1 GB
Microsoft (intfloat) · 0.56B · MIT
📊 Strong multilingual MTEB; beats BGE-large-en & Cohere multilingual-v3 at release
~1.5GB VRAM, runs on CPU
ollamaollama run hf.co/yixuan-chia/multilingual-e5-large-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/multilingual-e5-large-GGUF
Hugging Face ↗ · ollama
jina-embeddings-v3
1.1 GB
Jina AI · 0.572B · CC-BY-NC-4.0 (non-commercial)
📊 Outperforms OpenAI text-embedding-3-large & Cohere on MTEB multilingual at release (Sep 2024)
~1.5GB VRAM, runs on CPU
sentence-transformersollama run hf.co/gpustack/jina-embeddings-v3-GGUF
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/gpustack/jina-embeddings-v3-GGUF
Hugging Face ↗ · sentence-transformers
BGE-M3
1.2 GB
BAAI · 0.567B · MIT
📊 MIRACL nDCG@10 ~70 (multilingual SOTA at release); strong BEIR; hybrid dense+sparse+ColBERT
~2GB VRAM, runs on CPU
ollamaollama run bge-m3:567m-fp16
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull bge-m3
Hugging Face ↗ · ollama
bge-reranker-v2-m3
1.2 GB
BAAI · 0.568B · Apache-2.0
📊 Multilingual cross-encoder; strong MIRACL/BEIR reranking; lightweight
~2GB VRAM, runs on CPU
sentence-transformersollama run hf.co/gpustack/bge-reranker-v2-m3-GGUF
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai --model BAAI/bge-reranker-v2-m3 --task score
Hugging Face ↗ · sentence-transformers
snowflake-arctic-embed-l-v2.0
1.2 GB
Snowflake · 0.568B · Apache-2.0
📊 Top BEIR nDCG@10 + strong CLEF/MIRACL multilingual at release (Dec 2024)
~1.5GB VRAM, runs on CPU
ollamaollama run snowflake-arctic-embed2:568m
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull snowflake-arctic-embed2
Hugging Face ↗ · ollama
MoAI-Embedding-0.6B
1.2 GB
BCCard · 0.6B · apache-2.0 · discovered
~3GB VRAM, or CPU with 2GB RAM
dockerpip install sentence-transformers # SentenceTransformer("BCCard/MoAI-Embedding-0.6B")
Hugging Face ↗ · sentence-transformers
plamo-embedding-1b
2 GB
pfnet · 1B · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerpip install sentence-transformers # SentenceTransformer("pfnet/plamo-embedding-1b")
Hugging Face ↗ · sentence-transformers
llama-nemotron-embed-vl-1b-v2
2 GB
nvidia · 1B · other · discovered
~3GB VRAM, or CPU with 3GB RAM
dockerpip install sentence-transformers # SentenceTransformer("nvidia/llama-nemotron-embed-vl-1b-v2")
Hugging Face ↗ · sentence-transformers
Qwen3-Embedding-4B
2.5 GB
Alibaba Qwen · 4B · Apache-2.0
📊 MTEB Multilingual mean 69.45; near-SOTA retrieval
~3-6GB VRAM depending on quant
ollamaollama run dengcao/Qwen3-Embedding-4B:Q4_K_M
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run dengcao/Qwen3-Embedding-4B:Q4_K_M
Hugging Face ↗ · ollama
Qwen3-Reranker-4B
2.5 GB
Alibaba Qwen · 4B · Apache-2.0
📊 SOTA-class open reranker; large gains on BEIR/MIRACL/MTEB-R reranking
~3-6GB VRAM
transformersollama run dengcao/Qwen3-Reranker-4B:Q4_K_M
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai --model Qwen/Qwen3-Reranker-4B
Hugging Face ↗ · transformers
gte-Qwen2-1.5B-instruct
3.1 GB
Alibaba-NLP · 1.5B · Apache-2.0
📊 MTEB ~67 avg; instruction-tuned LLM embedder
~2-4GB VRAM
sentence-transformersollama run rjmalagon/gte-qwen2-1.5b-instruct-embed-f16
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run rjmalagon/gte-qwen2-1.5b-instruct-embed-f16
Hugging Face ↗ · sentence-transformers
mxbai-rerank-large-v2
3.1 GB
Mixedbread AI · 1.5B · Apache-2.0
📊 SOTA-class open reranker (2025); strong BEIR; Qwen2.5-1.5B backbone
~3-4GB VRAM
sentence-transformersollama run hf.co/mixedbread-ai/mxbai-rerank-large-v2-gguf
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai --model mixedbread-ai/mxbai-rerank-large-v2 --task score
Hugging Face ↗ · sentence-transformers
stella_en_1.5B_v5
3.1 GB
NovaSearch (dunzhang) · 1.5B · MIT
📊 MTEB (English v1) ~71.2 avg — top-tier open English embedder; basis of jasper (MTEB #2)
~3-4GB VRAM
sentence-transformersollama run hf.co/dunzhang/stella_en_1.5B_v5
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/dunzhang/stella_en_1.5B_v5
Hugging Face ↗ · sentence-transformers
Qwen3-VL-Embedding-2B
4 GB
Qwen · 2B · apache-2.0 · discovered
~5GB VRAM, or CPU with 6GB RAM
dockerpip install sentence-transformers # SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
Hugging Face ↗ · sentence-transformers
Qwen3-Embedding-8B
4.7 GB
Alibaba Qwen · 8B · Apache-2.0
📊 MTEB Multilingual mean 70.58 — #1 on MTEB multilingual leaderboard (Jun 5 2025)
~6-9GB VRAM at Q4-Q8; runs on CPU slowly
ollamaollama run dengcao/Qwen3-Embedding-8B:Q4_K_M
dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run dengcao/Qwen3-Embedding-8B:Q4_K_M
Hugging Face ↗ · ollama
Qwen3-Reranker-8B
4.7 GB
Alibaba Qwen · 8B · Apache-2.0
📊 Best open reranker quality in Qwen3 series; top BEIR/MIRACL reranking
~6-9GB VRAM at Q4-Q8
vllmollama run hf.co/Mungert/Qwen3-Reranker-8B-GGUF
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai --model Qwen/Qwen3-Reranker-8B
gte-Qwen2-7B-instruct
4.7 GB
Alibaba-NLP · 7B · Apache-2.0
📊 MTEB ~70 avg — #1 English & Chinese MTEB at release (Jun 2024)
~6-8GB VRAM at Q4; 15GB+ at F16
vllmollama run hf.co/mradermacher/gte-Qwen2-7B-instruct-GGUF:Q4_K_M
dockerdocker run --gpus all -p 8000:8000 vllm/vllm-openai --model Alibaba-NLP/gte-Qwen2-7B-instruct
MoAI-Embedding-4B
8 GB
BCCard · 4B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
dockerpip install sentence-transformers # SentenceTransformer("BCCard/MoAI-Embedding-4B")
Hugging Face ↗ · sentence-transformers
Qwen3-VL-Embedding-8B
16 GB
Qwen · 8B · apache-2.0 · discovered
~19GB VRAM (24GB GPU)
dockerpip install sentence-transformers # SentenceTransformer("Qwen/Qwen3-VL-Embedding-8B")
Hugging Face ↗ · sentence-transformers
Sign in to continue

LLM Switchboard is private — sign in with Authly to access the control room.

Sign in with Authly
← Back to home