Everything here runs on your own hardware. "Under 25 GB" means it fits a single 24 GB consumer GPU (RTX 3090/4090) — or a laptop with enough RAM on CPU. Pick by benchmark, copy the Ollama or Docker command, and run it. Sizes assume Q4_K_M quantization for LLMs.
🧪 Local sandbox checking…
Spin up a temporary Ollama service via Docker and test any model right here — the output runs on your hardware, no API keys needed. Click “Test locally” on any card, or type a model tag.
◎ Reasoning & chat 80
Qwen3 0.6B
0.5 GBAlibaba (Qwen Team) · 0.6B · Apache-2.0
📊 Small-scale; hybrid think/non-think; punches above size on reasoning
<1GB VRAM, runs on CPU / edge
ollama
ollama run qwen3:0.6bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:0.6b
Hugging Face ↗
· ollama
Llama 3.2 1B Instruct
0.8 GBMeta · 1.23B · Llama 3.2 Community License
📊 MMLU 49.3, IFEval 59.5, GSM8K 44.4
~1GB VRAM, runs easily on CPU / phones
ollama
ollama run llama3.2:1bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llama3.2:1b
Hugging Face ↗
· ollama
Gemma 3 1B Instruct
0.8 GBGoogle DeepMind · 1B · Gemma Terms of Use
📊 Text-only; solid small-model chat
<1GB VRAM, runs on CPU / mobile
ollama
ollama run gemma3:1bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma3:1b
Hugging Face ↗
· ollama
Falcon3-1B-Instruct
1 GBTII (UAE) · 1.7B · TII Falcon-LLM License 2.0
📊 Capable tiny model for size
<1GB VRAM, runs on CPU / edge
ollama
ollama run falcon3:1bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run falcon3:1b
Hugging Face ↗
· ollama
DeepSeek-R1-Distill-Qwen-1.5B
1.1 GBDeepSeek · 1.5B · MIT (distill; base Apache-2.0)
📊 Strong math reasoning for 1.5B (AIME/MATH); CoT traces
~2GB VRAM, runs on CPU
ollama
ollama run deepseek-r1:1.5bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-r1:1.5b
Hugging Face ↗
· ollama
SmolLM2-1.7B-Instruct
1.1 GBHugging Face · 1.7B · Apache-2.0
📊 Strong tiny on-device chat; good IFEval for size
~2GB VRAM, runs on CPU
ollama
ollama run smollm2:1.7bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run smollm2:1.7b
Hugging Face ↗
· ollama
Qwen3 1.7B
1.4 GBAlibaba (Qwen Team) · 1.7B · Apache-2.0
📊 Strong for size on math/reasoning vs Qwen2.5-3B
~2GB VRAM, runs on CPU
ollama
ollama run qwen3:1.7bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:1.7b
Hugging Face ↗
· ollama
IBM Granite 3.3 2B Instruct
1.5 GBIBM · 2.5B · Apache-2.0
📊 Compact enterprise model; thinking mode + FIM
~2GB VRAM, runs on CPU
ollama
ollama run granite3.3:2bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite3.3:2b
Hugging Face ↗
· ollama
EXAONE 3.5 2.4B Instruct
1.6 GBLG AI Research · 2.4B · EXAONE AI Model License (non-commercial/research)
📊 Efficient bilingual small model for edge
~2GB VRAM, runs on CPU
ollama
ollama run exaone3.5:2.4bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone3.5:2.4b
Hugging Face ↗
· ollama
EXAONE Deep 2.4B
1.6 GBLG AI Research · 2.4B · EXAONE AI Model License (non-commercial/research)
📊 AIME 2025 47.9; outperforms comparable-size reasoners on math
~2GB VRAM, runs on CPU
ollama
ollama run exaone-deep:2.4bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone-deep:2.4b
Hugging Face ↗
· ollama
VibeThinker-3B-GGUF
1.8 GBprithivMLmods · 3B · mit · discovered
~3GB VRAM, or CPU with 3GB RAM
ollama
ollama run hf.co/prithivMLmods/VibeThinker-3B-GGUFdocker
docker exec -it ollama ollama run hf.co/prithivMLmods/VibeThinker-3B-GGUF
Hugging Face ↗
· ollama
SmolLM3-3B
1.9 GBHugging Face · 3B · Apache-2.0
📊 Strong at 3B-4B scale; dual-mode reasoning, 6 languages, long context
~3GB VRAM, runs on CPU
ollama
ollama run hf.co/ggml-org/SmolLM3-3B-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/ggml-org/SmolLM3-3B-GGUF
Hugging Face ↗
· ollama
Llama 3.2 3B Instruct
2 GBMeta · 3.21B · Llama 3.2 Community License
📊 MMLU 63.4, IFEval 77.4, GSM8K 77.7, HumanEval ~50
~3GB VRAM, runs on CPU
ollama
ollama run llama3.2:3bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llama3.2:3b
Hugging Face ↗
· ollama
Falcon3-3B-Instruct
2 GBTII (UAE) · 3.2B · TII Falcon-LLM License 2.0
📊 Strong small model via pruning+distillation
~2-3GB VRAM, runs on CPU
ollama
ollama run falcon3:3bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run falcon3:3b
Hugging Face ↗
· ollama
IBM Granite 4.0 Micro (3B)
2.1 GBIBM · 3B · Apache-2.0
📊 Improved instruction following + tool calling; 12 languages
~2-3GB VRAM, runs on CPU
ollama
ollama run granite4:microdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite4:micro
Hugging Face ↗
· ollama
Phi-3.5-mini-instruct (3.8B)
2.3 GBMicrosoft · 3.8B · MIT
📊 MMLU ~69, strong reasoning for 3.8B, 128K context
~3-4GB VRAM, runs on CPU
ollama
ollama run phi3.5docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run phi3.5
Hugging Face ↗
· ollama
NVIDIA-Nemotron-3-Nano-4B-GGUF
2.4 GBnvidia · 4B · other · discovered
~4GB VRAM, or CPU with 4GB RAM
ollama
ollama run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUFdocker
docker exec -it ollama ollama run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
Hugging Face ↗
· ollama
Qwen3 4B
2.5 GBAlibaba (Qwen Team) · 4B · Apache-2.0
📊 Rivals Qwen2.5-72B-Instruct on several tasks (Qwen claim)
~4GB VRAM, runs on CPU
ollama
ollama run qwen3:4bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:4b
Hugging Face ↗
· ollama
Phi-4-mini-instruct (3.8B)
2.5 GBMicrosoft · 3.8B · MIT
📊 Strong multilingual + reasoning for 3.8B; function calling
~3-4GB VRAM, runs on CPU
ollama
ollama run phi4-minidocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run phi4-mini
Hugging Face ↗
· ollama
Phi-4-mini-reasoning (3.8B)
2.5 GBMicrosoft · 3.8B · MIT
📊 Math-focused; distilled from DeepSeek-R1 synthetic math data
~3-4GB VRAM, runs on CPU
ollama
ollama run hf.co/unsloth/Phi-4-mini-reasoning-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/unsloth/Phi-4-mini-reasoning-GGUF
Hugging Face ↗
· ollama
Yi-1.5-6B-Chat
3.6 GB01.AI · 6B · Apache-2.0
📊 Solid small bilingual chat
~4GB VRAM, runs on CPU
ollama
ollama run yi:6bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run yi:6b
Hugging Face ↗
· ollama
Falcon3-7B-Instruct
4.3 GBTII (UAE) · 7B · TII Falcon-LLM License 2.0
📊 SOTA-class under 13B at release; strong math/reasoning
~5GB VRAM, runs on CPU
ollama
ollama run falcon3:7bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run falcon3:7b
Hugging Face ↗
· ollama
Mistral 7B Instruct v0.3
4.4 GBMistral AI · 7.25B · Apache-2.0
📊 MMLU ~62, classic strong 7B baseline, function calling
~5GB VRAM, runs on CPU
ollama
ollama run mistral:7bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run mistral:7b
Hugging Face ↗
· ollama
DeepSeek-R1-Distill-Qwen-7B
4.7 GBDeepSeek · 7.6B · MIT (distill; base Apache-2.0)
📊 AIME/MATH strong for 7B; outperforms many non-reasoning models
~6GB VRAM, runs on CPU
ollama
ollama run deepseek-r1:7bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-r1:7b
Hugging Face ↗
· ollama
Ministral 8B Instruct
4.8 GBMistral AI · 8B · Mistral Research License (MRL)
📊 Beats Mistral 7B and Llama 3.1 8B on many tasks; 128K context
~6GB VRAM, runs on CPU
ollama
ollama run hf.co/mistralai/Ministral-8B-Instruct-2410docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/mistralai/Ministral-8B-Instruct-2410
Hugging Face ↗
· ollama
EXAONE 3.5 7.8B Instruct
4.8 GBLG AI Research · 7.8B · EXAONE AI Model License (non-commercial/research)
📊 Strong bilingual EN/KO instruction-following
~6GB VRAM, runs on CPU
ollama
ollama run exaone3.5:7.8bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone3.5:7.8b
Hugging Face ↗
· ollama
EXAONE Deep 7.8B
4.8 GBLG AI Research · 7.8B · EXAONE AI Model License (non-commercial/research)
📊 AIME 2025 59.6; strong math/science/coding reasoning for size
~6GB VRAM, runs on CPU
ollama
ollama run exaone-deep:7.8bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone-deep:7.8b
Hugging Face ↗
· ollama
Bonsai-8B-gguf
4.8 GBprism-ml · 8B · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/prism-ml/Bonsai-8B-ggufdocker
docker exec -it ollama ollama run hf.co/prism-ml/Bonsai-8B-gguf
Hugging Face ↗
· ollama
LFM2.5-8B-A1B-GGUF
4.8 GBLiquidAI · 8B · other · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/LiquidAI/LFM2.5-8B-A1B-GGUFdocker
docker exec -it ollama ollama run hf.co/LiquidAI/LFM2.5-8B-A1B-GGUF
Hugging Face ↗
· ollama
Llama 3.1 8B Instruct
4.9 GBMeta · 8.03B · Llama 3.1 Community License
📊 MMLU 69.4, HumanEval 72.6, GSM8K 84.5, IFEval 80.4
~6GB VRAM, runs on CPU
ollama
ollama run llama3.1:8bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llama3.1:8b
Hugging Face ↗
· ollama
DeepSeek-R1-Distill-Llama-8B
4.9 GBDeepSeek · 8B · llama3.1 license (distill MIT)
📊 Strong CoT math/reasoning for 8B
~6GB VRAM, runs on CPU
ollama
ollama run deepseek-r1:8bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-r1:8b
Hugging Face ↗
· ollama
IBM Granite 3.3 8B Instruct
4.9 GBIBM · 8.1B · Apache-2.0
📊 Enterprise-tuned; thinking mode, FIM, strong RAG/tool use
~6GB VRAM, runs on CPU
ollama
ollama run granite3.3:8bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite3.3:8b
Hugging Face ↗
· ollama
InternLM3-8B-Instruct
4.9 GBShanghai AI Lab (InternLM) · 8B · Apache-2.0
📊 Surpasses Llama3.1-8B and Qwen2.5-7B on reasoning/knowledge tasks
~6GB VRAM, runs on CPU
ollama
ollama run internlm/internlm3-8b-instructdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run internlm/internlm3-8b-instruct
Hugging Face ↗
· ollama
GLM-5.2-GGUF
5 GBunsloth · mit · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/unsloth/GLM-5.2-GGUFdocker
docker exec -it ollama ollama run hf.co/unsloth/GLM-5.2-GGUF
Hugging Face ↗
· ollama
deepseek-v4-gguf
5 GBantirez · mit · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/antirez/deepseek-v4-ggufdocker
docker exec -it ollama ollama run hf.co/antirez/deepseek-v4-gguf
Hugging Face ↗
· ollama
Qwable-v1-GGUF
5 GBlordx64 · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/lordx64/Qwable-v1-GGUFdocker
docker exec -it ollama ollama run hf.co/lordx64/Qwable-v1-GGUF
Hugging Face ↗
· ollama
supra-title-50M-pre-gguf
5 GBSupraLabs · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/SupraLabs/supra-title-50M-pre-ggufdocker
docker exec -it ollama ollama run hf.co/SupraLabs/supra-title-50M-pre-gguf
Hugging Face ↗
· ollama
Supra-1.5-50M-instruct-exp-gguf
5 GBSupraLabs · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/SupraLabs/Supra-1.5-50M-instruct-exp-ggufdocker
docker exec -it ollama ollama run hf.co/SupraLabs/Supra-1.5-50M-instruct-exp-gguf
Hugging Face ↗
· ollama
GLM-5.2-REAP50-Q3_K_M-GGUF
5 GBpipenetwork · mit · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/pipenetwork/GLM-5.2-REAP50-Q3_K_M-GGUFdocker
docker exec -it ollama ollama run hf.co/pipenetwork/GLM-5.2-REAP50-Q3_K_M-GGUF
Hugging Face ↗
· ollama
Z-Image-Engineer-V6-GGUF
5 GBBennyDaBall · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/BennyDaBall/Z-Image-Engineer-V6-GGUFdocker
docker exec -it ollama ollama run hf.co/BennyDaBall/Z-Image-Engineer-V6-GGUF
Hugging Face ↗
· ollama
GLM-4.7-Flash-GGUF
5 GBunsloth · mit · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/unsloth/GLM-4.7-Flash-GGUFdocker
docker exec -it ollama ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF
Hugging Face ↗
· ollama
Cohere Command-R7B
5.1 GBCohere · 7B · CC-BY-NC 4.0 (non-commercial) + C4AI Acceptable Use
📊 Top-tier speed/quality for 7B; excels at RAG, tool use, agents; 23 languages
~5GB VRAM, runs on CPU / edge
ollama
ollama run command-r7bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run command-r7b
Hugging Face ↗
· ollama
Qwen3 8B
5.2 GBAlibaba (Qwen Team) · 8.2B · Apache-2.0
📊 MMLU ~77, strong math/code; hybrid thinking
~6GB VRAM, runs on CPU
ollama
ollama run qwen3:8bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:8b
Hugging Face ↗
· ollama
Yi-1.5-9B-Chat
5.3 GB01.AI · 8.8B · Apache-2.0
📊 Strong bilingual (EN/ZH) chat; competitive ~9B coding/math
~6-7GB VRAM, runs on CPU
ollama
ollama run yi:9bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run yi:9b
Hugging Face ↗
· ollama
GLM-4-9B-Chat
5.7 GBZhipu AI / Z.ai (THUDM) · 9.4B · GLM-4 License (free for many uses; check terms)
📊 Beats Llama-3-8B on semantics/math/reasoning/code/knowledge; 26 languages
~7GB VRAM, runs on CPU
ollama
ollama run glm4:9bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run glm4:9b
Hugging Face ↗
· ollama
Gemma 2 9B Instruct
5.8 GBGoogle DeepMind · 9.2B · Gemma Terms of Use
📊 MMLU ~71, beat Llama-3-8B on many tasks at release
~6-7GB VRAM, runs on CPU
ollama
ollama run gemma2:9bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma2:9b
Hugging Face ↗
· ollama
Falcon3-10B-Instruct
6.3 GBTII (UAE) · 10.3B · TII Falcon-LLM License 2.0
📊 Best-in-class under 13B at release (depth up-scaled from 7B)
~7-8GB VRAM
ollama
ollama run falcon3:10bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run falcon3:10b
Hugging Face ↗
· ollama
gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF
7.2 GByuxinlu1 · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
ollama
ollama run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUFdocker
docker exec -it ollama ollama run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF
Hugging Face ↗
· ollama
gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF
7.2 GByuxinlu1 · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
ollama
ollama run hf.co/yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUFdocker
docker exec -it ollama ollama run hf.co/yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF
Hugging Face ↗
· ollama
Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF
7.2 GByuxinlu1 · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
ollama
ollama run hf.co/yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUFdocker
docker exec -it ollama ollama run hf.co/yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF
Hugging Face ↗
· ollama
Gwimi-4-12B-IT-GGUF
7.2 GBtrjxter · 12B · gemma · discovered
~10GB VRAM (RTX 3090/4090)
ollama
ollama run hf.co/trjxter/Gwimi-4-12B-IT-GGUFdocker
docker exec -it ollama ollama run hf.co/trjxter/Gwimi-4-12B-IT-GGUF
Hugging Face ↗
· ollama
Qwen3.6-14B-A3B-FableVibes-GGUF
8.4 GBtvall43 · 14B · apache-2.0 · discovered
~11GB VRAM (RTX 3090/4090)
ollama
ollama run hf.co/tvall43/Qwen3.6-14B-A3B-FableVibes-GGUFdocker
docker exec -it ollama ollama run hf.co/tvall43/Qwen3.6-14B-A3B-FableVibes-GGUF
Hugging Face ↗
· ollama
Qwen3.6-14B-A3B-VibeForged-v2-GGUF
8.4 GBtvall43 · 14B · apache-2.0 · discovered
~11GB VRAM (RTX 3090/4090)
ollama
ollama run hf.co/tvall43/Qwen3.6-14B-A3B-VibeForged-v2-GGUFdocker
docker exec -it ollama ollama run hf.co/tvall43/Qwen3.6-14B-A3B-VibeForged-v2-GGUF
Hugging Face ↗
· ollama
DeepSeek-R1-Distill-Qwen-14B
9 GBDeepSeek · 14.8B · MIT (distill; base Apache-2.0)
📊 Approaches o1-mini on reasoning; strong AIME/MATH
~10-12GB VRAM
ollama
ollama run deepseek-r1:14bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-r1:14b
Hugging Face ↗
· ollama
Phi-4 (14B)
9.1 GBMicrosoft · 14.7B · MIT
📊 MMLU 84.8, GPQA 56.1, MATH 80.4, HumanEval 82.6
~10-12GB VRAM
ollama
ollama run phi4docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run phi4
Hugging Face ↗
· ollama
Phi-4-reasoning (14B)
9.1 GBMicrosoft · 14.7B · MIT
📊 AIME 2024 75.3, HumanEval+ 92.9, IFEval 83.4, OmniMath 76.6
~10-12GB VRAM
ollama
ollama run hf.co/unsloth/Phi-4-reasoning-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/unsloth/Phi-4-reasoning-GGUF
Hugging Face ↗
· ollama
Phi-4-reasoning-plus (14B)
9.1 GBMicrosoft · 14.7B · MIT
📊 AIME 2024 81.3, AIME 2025 82.5, HumanEval+ 92.3, IFEval 84.9, OmniMath 81.9
~10-12GB VRAM
ollama
ollama run hf.co/unsloth/Phi-4-reasoning-plus-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/unsloth/Phi-4-reasoning-plus-GGUF
Hugging Face ↗
· ollama
Qwen3 14B
9.3 GBAlibaba (Qwen Team) · 14.8B · Apache-2.0
📊 GPQA ~60s, strong AIME/LiveCodeBench for size
~10-12GB VRAM
ollama
ollama run qwen3:14bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:14b
Hugging Face ↗
· ollama
Qwopus-GLM-18B-Merged-GGUF
10.8 GBJackrong · 18B · apache-2.0 · discovered
~13GB VRAM (RTX 3090/4090)
ollama
ollama run hf.co/Jackrong/Qwopus-GLM-18B-Merged-GGUFdocker
docker exec -it ollama ollama run hf.co/Jackrong/Qwopus-GLM-18B-Merged-GGUF
Hugging Face ↗
· ollama
GLM-4.7-Flash-REAP-23B-A3B-GGUF
13.8 GBunsloth · 23B · mit · discovered
~16GB VRAM (RTX 3090/4090)
ollama
ollama run hf.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUFdocker
docker exec -it ollama ollama run hf.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF
Hugging Face ↗
· ollama
gpt-oss-20b
14 GBOpenAI · 21B · Apache-2.0
📊 ~OpenAI o3-mini level on core reasoning; strong tool use / function calling
~16GB memory (runs on 16GB edge devices)
ollama
ollama run gpt-oss:20bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gpt-oss:20b
Hugging Face ↗
· ollama
Mistral Small 3.2 24B Instruct
15 GBMistral AI · 24B · Apache-2.0
📊 Comparable to much larger models; improved instruction following, function calling, less repetition vs 3.1
~15-16GB VRAM (fits RTX 4090 / 32GB Mac)
ollama
ollama run mistral-small3.2docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run mistral-small3.2
Hugging Face ↗
· ollama
Qwen3.6-27B-MTP-pi-tune-GGUF
16.2 GBbytkim · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
ollama
ollama run hf.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUFdocker
docker exec -it ollama ollama run hf.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF
Hugging Face ↗
· ollama
qwen3.6-27b-fable5-lora
16.2 GBhotdogs · 27B · agpl-3.0 · discovered
~20GB VRAM (24GB GPU)
ollama
ollama run hf.co/hotdogs/qwen3.6-27b-fable5-loradocker
docker exec -it ollama ollama run hf.co/hotdogs/qwen3.6-27b-fable5-lora
Hugging Face ↗
· ollama
Qwen3.6-27B-MTP-TQ3_4S
16.2 GBYTan2000 · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
ollama
ollama run hf.co/YTan2000/Qwen3.6-27B-MTP-TQ3_4Sdocker
docker exec -it ollama ollama run hf.co/YTan2000/Qwen3.6-27B-MTP-TQ3_4S
Hugging Face ↗
· ollama
qwen3.6-27b-cybersecurity-lora
16.2 GBhotdogs · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
ollama
ollama run hf.co/hotdogs/qwen3.6-27b-cybersecurity-loradocker
docker exec -it ollama ollama run hf.co/hotdogs/qwen3.6-27b-cybersecurity-lora
Hugging Face ↗
· ollama
Qwen3.6-27B-IQ4_XS-pure-with-MTP-GGUF
16.2 GBGianniDPC · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
ollama
ollama run hf.co/GianniDPC/Qwen3.6-27B-IQ4_XS-pure-with-MTP-GGUFdocker
docker exec -it ollama ollama run hf.co/GianniDPC/Qwen3.6-27B-IQ4_XS-pure-with-MTP-GGUF
Hugging Face ↗
· ollama
Gemma 2 27B Instruct
16.5 GBGoogle DeepMind · 27.2B · Gemma Terms of Use
📊 MMLU ~75, strong text chat at release
~17GB VRAM
ollama
ollama run gemma2:27bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma2:27b
Hugging Face ↗
· ollama
Qwen3.6 27B
17 GBAlibaba (Qwen Team) · 27B · Apache-2.0
📊 Flagship-level coding in a 27B dense model (Qwen3.6 release); strong agentic coding + thinking preservation
~17GB VRAM (fits 24GB card)
ollama
ollama run qwen3.6:27bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3.6:27b
Hugging Face ↗
· ollama
Gemma-4-31B-JANG_4M-CRACK-GGUF
18.6 GBdouyamv · 31B · gemma · discovered
~22GB VRAM (24GB GPU)
ollama
ollama run hf.co/douyamv/Gemma-4-31B-JANG_4M-CRACK-GGUFdocker
docker exec -it ollama ollama run hf.co/douyamv/Gemma-4-31B-JANG_4M-CRACK-GGUF
Hugging Face ↗
· ollama
Qwen3 30B-A3B (MoE)
19 GBAlibaba (Qwen Team) · 30.5B · Apache-2.0
📊 MMLU-Redux 89.3, GPQA 70.4, AIME25 70.9, LiveCodeBench v5 62.6 (2507 update)
~19GB VRAM; only ~3B active so fast even partly on CPU
ollama
ollama run qwen3:30bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:30b
Hugging Face ↗
· ollama
IBM Granite 4.0 Small (32B-A9B MoE, hybrid Mamba-2)
19 GBIBM · 32B · Apache-2.0
📊 Hybrid Mamba-2 + attention; efficient long-context enterprise tasks
~19GB VRAM; MoE ~9B active so memory-efficient
ollama
ollama run granite4:smalldocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite4:small
Hugging Face ↗
· ollama
EXAONE 3.5 32B Instruct
19 GBLG AI Research · 32B · EXAONE AI Model License (non-commercial/research)
📊 Powerful bilingual EN/KO performance at 32B
~19GB VRAM (fits 24GB card)
ollama
ollama run exaone3.5:32bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone3.5:32b
Hugging Face ↗
· ollama
EXAONE Deep 32B
19 GBLG AI Research · 32B · EXAONE AI Model License (non-commercial/research)
📊 AIME 2024 90.0; matches DeepSeek-R1 (671B) on AIME 2025
~19GB VRAM (fits 24GB card)
ollama
ollama run exaone-deep:32bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run exaone-deep:32b
Hugging Face ↗
· ollama
Qwen3 32B
20 GBAlibaba (Qwen Team) · 32.8B · Apache-2.0
📊 Flagship dense Qwen3; competitive with much larger models on reasoning/code
~20GB VRAM (fits 24GB card)
ollama
ollama run qwen3:32bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3:32b
Hugging Face ↗
· ollama
DeepSeek-R1-Distill-Qwen-32B
20 GBDeepSeek · 32.8B · MIT (distill; base Apache-2.0)
📊 Outperforms OpenAI o1-mini; SOTA dense reasoning at release
~20GB VRAM (fits 24GB card)
ollama
ollama run deepseek-r1:32bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-r1:32b
Hugging Face ↗
· ollama
QwQ-32B
20 GBAlibaba (Qwen Team) · 32.5B · Apache-2.0
📊 Competitive with DeepSeek-R1 on math/reasoning despite 32B size
~20GB VRAM (fits 24GB card)
ollama
ollama run qwqdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwq
Hugging Face ↗
· ollama
Yi-1.5-34B-Chat
20.6 GB01.AI · 34.4B · Apache-2.0
📊 Competitive with much larger models on chat/reasoning at release
~21GB VRAM (fits 24GB card)
ollama
ollama run yi:34bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run yi:34b
Hugging Face ↗
· ollama
SIQ-1-35B
21 GBAlexWortega · 35B · apache-2.0 · discovered
~24GB VRAM (24GB GPU)
ollama
ollama run hf.co/AlexWortega/SIQ-1-35Bdocker
docker exec -it ollama ollama run hf.co/AlexWortega/SIQ-1-35B
Hugging Face ↗
· ollama
Qwen3.6-35B-A3B-REAP-90pct-GGUF
21 GBDJLougen · 35B · apache-2.0 · discovered
~24GB VRAM (24GB GPU)
ollama
ollama run hf.co/DJLougen/Qwen3.6-35B-A3B-REAP-90pct-GGUFdocker
docker exec -it ollama ollama run hf.co/DJLougen/Qwen3.6-35B-A3B-REAP-90pct-GGUF
Hugging Face ↗
· ollama
⌨ Coding 35
Qwen2.5-Coder-0.5B-Instruct
0.4 GBAlibaba (Qwen) · 0.5B · Apache-2.0
📊 HumanEval 61.6, MBPP 52.4
~1GB VRAM, runs easily on CPU
ollama
ollama run qwen2.5-coder:0.5bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:0.5b
Hugging Face ↗
· ollama
DeepSeek-Coder-1.3B-Instruct
0.8 GBDeepSeek · 1.3B · DeepSeek License (permits commercial use)
📊 HumanEval 65.2, MBPP 49.4
~1-2GB VRAM, runs on CPU
ollama
ollama run deepseek-coder:1.3b-instructdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-coder:1.3b-instruct
Hugging Face ↗
· ollama
Yi-Coder-1.5B-Chat
0.9 GB01.AI · 1.5B · Apache-2.0
📊 HumanEval ~41.5, LiveCodeBench ~12; 52 languages, 128K context
~2GB VRAM, runs on CPU
ollama
ollama run yi-coder:1.5bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run yi-coder:1.5b
Hugging Face ↗
· ollama
Qwen2.5-Coder-1.5B-Instruct
1 GBAlibaba (Qwen) · 1.5B · Apache-2.0
📊 HumanEval 70.7, MBPP 69.2
~2GB VRAM, runs on CPU
ollama
ollama run qwen2.5-coder:1.5bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:1.5b
Hugging Face ↗
· ollama
OpenCoder-1.5B-Instruct
1 GBINF (infly) / OpenCoder team · 1.5B · INF Open-Source License (commercial use permitted)
📊 HumanEval 72.5 (HumanEval+ 67.7), MBPP 72.7, BigCodeBench 33.3, LiveCodeBench 12.8
~2GB VRAM, runs on CPU
ollama
ollama run hf.co/QuantFactory/OpenCoder-1.5B-Instruct-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/QuantFactory/OpenCoder-1.5B-Instruct-GGUF
Hugging Face ↗
· ollama
CodeGemma-2B
1.6 GBGoogle · 2B · Gemma Terms of Use (commercial OK with use restrictions)
📊 HumanEval 31.1, MBPP 43.6 (base, code-completion focused)
~2-3GB VRAM, runs on CPU
ollama
ollama run codegemma:2bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codegemma:2b
Hugging Face ↗
· ollama
StarCoder2-3B
1.8 GBBigCode (ServiceNow/HuggingFace/NVIDIA) · 3B · BigCode OpenRAIL-M (commercial OK, responsible-use clauses)
📊 HumanEval ~31.7, strong FIM; trained on The Stack v2 (17 langs, 3T+ tokens)
~2-3GB VRAM, runs on CPU
ollama
ollama run starcoder2:3bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run starcoder2:3b
Hugging Face ↗
· ollama
Qwen2.5-Coder-3B-Instruct
1.9 GBAlibaba (Qwen) · 3B · Qwen Research License (non-commercial)
📊 HumanEval 84.1, MBPP 73.6
~3GB VRAM, runs on CPU
ollama
ollama run qwen2.5-coder:3bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:3b
Hugging Face ↗
· ollama
Granite-3B-Code-Instruct-128K
2 GBIBM · 3B · Apache-2.0
📊 HumanEvalSynthesize ~exceeds CodeLlama-34B-Instruct; enterprise RAG/tool-use tuned
~3GB VRAM, runs on ~4GB RAM / CPU
ollama
ollama run granite-code:3bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite-code:3b
Hugging Face ↗
· ollama
Opus4.7-GODs.Ghost.Codex-4B.GGuF
2.4 GBWithinUsAI · 4B · discovered
~4GB VRAM, or CPU with 4GB RAM
ollama
ollama run hf.co/WithinUsAI/Opus4.7-GODs.Ghost.Codex-4B.GGuFdocker
docker exec -it ollama ollama run hf.co/WithinUsAI/Opus4.7-GODs.Ghost.Codex-4B.GGuF
Hugging Face ↗
· ollama
DeepSeek-Coder-6.7B-Instruct
3.8 GBDeepSeek · 6.7B · DeepSeek License (permits commercial use)
📊 HumanEval 78.6, MBPP 65.4, DS-1000 strong
~5GB VRAM (6GB+ GPU), runs on CPU
ollama
ollama run deepseek-coder:6.7b-instructdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-coder:6.7b-instruct
Hugging Face ↗
· ollama
CodeLlama-7B-Instruct
3.8 GBMeta · 7B · Llama 2 Community License (commercial OK; >700M MAU must request Meta license)
📊 HumanEval 34.8 (instruct), MBPP ~44
~5GB VRAM, runs on CPU
ollama
ollama run codellama:7b-instructdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codellama:7b-instruct
Hugging Face ↗
· ollama
StarCoder2-7B
4 GBBigCode (ServiceNow/HuggingFace/NVIDIA) · 7B · BigCode OpenRAIL-M (commercial OK, responsible-use clauses)
📊 HumanEval 35.4, trained on The Stack v2 (17 langs, 3.5T+ tokens)
~5GB VRAM, runs on CPU (slow)
ollama
ollama run starcoder2:7bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run starcoder2:7b
Hugging Face ↗
· ollama
Codestral-Mamba-7B (mamba-codestral-7B-v0.1)
4.4 GBMistral AI · 7.3B · Apache-2.0
📊 HumanEval 75.0, beats CodeGemma-1.1-7B (61) and DeepSeek-v1.5-7B (66)
~5GB VRAM; linear-time Mamba2 inference scales to long sequences cheaply
ollama
ollama run hf.co/Agnuxo/Mamba-Codestral-7B-v0.1-instruct-python_coding_assistant-GGUF_4bitdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/Agnuxo/Mamba-Codestral-7B-v0.1-instruct-python_coding_assistant-GGUF_4bit
Hugging Face ↗
· ollama
Granite-8B-Code-Instruct-128K
4.6 GBIBM · 8B · Apache-2.0
📊 HumanEvalSynthesize Python 62.2 (avg 51.4), MBPP solid; 116 languages
~5-6GB VRAM (8GB GPU)
ollama
ollama run granite-code:8bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite-code:8b
Hugging Face ↗
· ollama
Qwen2.5-Coder-7B-Instruct
4.7 GBAlibaba (Qwen) · 7B · Apache-2.0
📊 HumanEval 88.4, MBPP 83.5, Aider ~57
~6GB VRAM, runs on CPU (slow)
ollama
ollama run qwen2.5-coder:7bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:7b
Hugging Face ↗
· ollama
OpenCoder-8B-Instruct
4.8 GBINF (infly) / OpenCoder team · 8B · INF Open-Source License (commercial use permitted)
📊 HumanEval 83.5 (HumanEval+ 78.7), MBPP 79.1, BigCodeBench 40.3, LiveCodeBench 23.2
~6GB VRAM (8GB GPU), runs on CPU (slow)
ollama
ollama run hf.co/QuantFactory/OpenCoder-8B-Instruct-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/QuantFactory/OpenCoder-8B-Instruct-GGUF
Hugging Face ↗
· ollama
CodeGemma-7B-it (v1.1)
5 GBGoogle · 7B · Gemma Terms of Use (commercial OK with use restrictions)
📊 HumanEval 60.4 (v1.1; 56.1 v1.0), MBPP 55.2
~6GB VRAM, runs on CPU (slow)
ollama
ollama run codegemma:7bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codegemma:7b
Hugging Face ↗
· ollama
Qwen3-Coder-Next-GGUF
5 GBunsloth · apache-2.0 · discovered
~6GB VRAM, or CPU with 8GB RAM
ollama
ollama run hf.co/unsloth/Qwen3-Coder-Next-GGUFdocker
docker exec -it ollama ollama run hf.co/unsloth/Qwen3-Coder-Next-GGUF
Hugging Face ↗
· ollama
Yi-Coder-9B-Chat
5.4 GB01.AI · 9B · Apache-2.0
📊 HumanEval 85.4, MBPP 73.8, LiveCodeBench 23.4 (only sub-10B model above 20%)
~6-7GB VRAM (8GB+ GPU)
ollama
ollama run yi-coder:9bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run yi-coder:9b
Hugging Face ↗
· ollama
gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
7.2 GByuxinlu1 · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
ollama
ollama run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUFdocker
docker exec -it ollama ollama run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
Hugging Face ↗
· ollama
RavenX-OpenFable-Coderagent-gemma-4-12B-coder-fable5-composer-Soulinfused-Remastered-GGUF
7.2 GBdeadbydawn101 · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
ollama
ollama run hf.co/deadbydawn101/RavenX-OpenFable-Coderagent-gemma-4-12B-coder-fable5-composer-Soulinfused-Remastered-GGUFdocker
docker exec -it ollama ollama run hf.co/deadbydawn101/RavenX-OpenFable-Coderagent-gemma-4-12B-coder-fable5-composer-Soulinfused-Remastered-GGUF
Hugging Face ↗
· ollama
CodeLlama-13B-Instruct
7.4 GBMeta · 13B · Llama 2 Community License (commercial OK; >700M MAU must request Meta license)
📊 HumanEval 36.0 (base; instruct ~42.7), MBPP ~49
~8GB VRAM (8GB/12GB GPU)
ollama
ollama run codellama:13b-instructdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codellama:13b-instruct
Hugging Face ↗
· ollama
DeepSeek-Coder-V2-Lite-Instruct (16B MoE)
8.9 GBDeepSeek · 16B · DeepSeek License (permits commercial use)
📊 HumanEval 81.1, MBPP+ 68.8, supports 338 languages
~10-11GB VRAM incl. KV cache (16GB GPU); MoE only activates 2.4B params so it's fast
ollama
ollama run deepseek-coder-v2:16bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-coder-v2:16b
Hugging Face ↗
· ollama
Qwen2.5-Coder-14B-Instruct
9 GBAlibaba (Qwen) · 14B · Apache-2.0
📊 HumanEval 89.6, MBPP 86.2, Aider ~62
~11GB VRAM (fits 12GB/16GB GPUs)
ollama
ollama run qwen2.5-coder:14bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:14b
Hugging Face ↗
· ollama
StarCoder2-15B
9.1 GBBigCode (ServiceNow/HuggingFace/NVIDIA) · 15B · BigCode OpenRAIL-M (commercial OK, responsible-use clauses)
📊 HumanEval 46.3, MBPP ~66; trained on 600+ languages, 4T+ tokens
~9-10GB VRAM (12GB/16GB GPU)
ollama
ollama run starcoder2:15bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run starcoder2:15b
Hugging Face ↗
· ollama
StarCoder2-15B-Instruct-v0.1
9.1 GBBigCode · 15B · BigCode OpenRAIL-M (commercial OK, responsible-use clauses)
📊 HumanEval 72.6 (surpasses CodeLlama-70B-Instruct's 72.0); fully self-aligned, no GPT distillation
~9-10GB VRAM (12GB/16GB GPU)
ollama
ollama run hf.co/lmstudio-community/starcoder2-15b-instruct-v0.1-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/lmstudio-community/starcoder2-15b-instruct-v0.1-GGUF
Hugging Face ↗
· ollama
Granite-20B-Code-Instruct
12 GBIBM · 20B · Apache-2.0
📊 HumanEvalSynthesize avg ~mid-30s; outperforms 2x-larger CodeLlama on instruct tasks
~12GB VRAM (16GB GPU)
ollama
ollama run granite-code:20bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite-code:20b
Hugging Face ↗
· ollama
Codestral-22B-v0.1
13 GBMistral AI · 22.2B · Mistral AI Non-Production License (MNPL) — research/personal only, no production without commercial license
📊 HumanEval 81.1, MBPP 78.2, 80+ languages, native fill-in-the-middle
~13-16GB VRAM (16GB/24GB GPU)
ollama
ollama run codestral:22bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codestral:22b
Hugging Face ↗
· ollama
Qwable-5-27B-Coder-GGUF
16.2 GBDJLougen · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
ollama
ollama run hf.co/DJLougen/Qwable-5-27B-Coder-GGUFdocker
docker exec -it ollama ollama run hf.co/DJLougen/Qwable-5-27B-Coder-GGUF
Hugging Face ↗
· ollama
Qwen3-Coder-30B-A3B-Instruct-GGUF
18 GBunsloth · 30B · apache-2.0 · discovered
~21GB VRAM (24GB GPU)
ollama
ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUFdocker
docker exec -it ollama ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
Hugging Face ↗
· ollama
Granite-34B-Code-Instruct
19 GBIBM · 34B · Apache-2.0
📊 HumanEvalSynthesize avg 41.9 (best of Granite-Code, near CodeLlama-70B-Instruct's 41.1)
~19GB VRAM (24GB GPU)
ollama
ollama run granite-code:34bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run granite-code:34b
Hugging Face ↗
· ollama
DeepSeek-Coder-33B-Instruct
19 GBDeepSeek · 33B · DeepSeek License (permits commercial use)
📊 HumanEval 79.3, MBPP 70.0; beats CodeLlama-34B by ~8pts, ~GPT-3.5-turbo level
~19GB VRAM (24GB GPU)
ollama
ollama run deepseek-coder:33b-instructdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run deepseek-coder:33b-instruct
Hugging Face ↗
· ollama
CodeLlama-34B-Instruct
19 GBMeta · 34B · Llama 2 Community License (commercial OK; >700M MAU must request Meta license)
📊 HumanEval 53.7 (base; instruct ~50), on par with original ChatGPT/GPT-3.5
~19GB VRAM (24GB GPU)
ollama
ollama run codellama:34b-instructdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run codellama:34b-instruct
Hugging Face ↗
· ollama
Qwen2.5-Coder-32B-Instruct
20 GBAlibaba (Qwen) · 32B · Apache-2.0
📊 HumanEval 92.7, MBPP 90.2, Aider 73.7, LiveCodeBench 31.4
~20GB VRAM (24GB GPU) or 32GB unified-memory Mac
ollama
ollama run qwen2.5-coder:32bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5-coder:32b
Hugging Face ↗
· ollama
◉ Vision (VLM) 46
SmolVLM2 256M (Video) Instruct
0.5 GBHugging Face · 0.256B · Apache-2.0
📊 MMMU 29.0, DocVQA 58.3, OCRBench 52.6, TextVQA 49.9, Video-MME 33.7; smallest VLM in the world
<1GB VRAM, runs on CPU / in-browser (WebGPU)
transformers
ollama run hf.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct (GGUF community build) — or use transformersdocker
docker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModelForImageTextToText, AutoProcessor; AutoModelForImageTextToText.from_pretrained('HuggingFaceTB/SmolVLM2-256M-Video-Instruct')"
Hugging Face ↗
· transformers
SmolVLM2 500M (Video) Instruct
1 GBHugging Face · 0.5B · Apache-2.0
📊 MMMU 33.7, DocVQA 70.5, Video-MME 42.2; near-2B quality at a fraction of size
~1GB VRAM, runs on CPU
transformers
ollama run hf.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct (GGUF community build) — or use transformersdocker
docker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModelForImageTextToText; AutoModelForImageTextToText.from_pretrained('HuggingFaceTB/SmolVLM2-500M-Video-Instruct')"
Hugging Face ↗
· transformers
InternVL3 1B Instruct
1.2 GBOpenGVLab (Shanghai AI Lab) · 1B · MIT (LLM component: Qwen2.5 license)
📊 MMMU ~43, DocVQA ~88, strong OCR; InternVL3 family tops out at MMMU 72.2 (78B)
~2GB VRAM, runs on CPU
transformers
ollama run hf.co/mradermacher/InternVL3-1B-GGUF:Q4_K_M (community) — or use transformers/lmdeploydocker
docker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModel; AutoModel.from_pretrained('OpenGVLab/InternVL3-1B', trust_remote_code=True)"
Hugging Face ↗
· transformers
InternVL3 2B Instruct
1.6 GBOpenGVLab (Shanghai AI Lab) · 2B · MIT (LLM component: Qwen2.5 license)
📊 MMMU ~48, DocVQA ~89, ChartQA strong; HallusionBench improved over 2.5
~3GB VRAM, runs on CPU / 8GB laptop
transformers
ollama run hf.co/mradermacher/InternVL3-2B-GGUF:Q4_K_M (community) — or use transformers/lmdeploydocker
docker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModel; AutoModel.from_pretrained('OpenGVLab/InternVL3-2B', trust_remote_code=True)"
Hugging Face ↗
· transformers
Moondream2
1.7 GBVikhyat Korrapati (Moondream) · 1.9B · Apache-2.0
📊 VQAv2 78.1, GQA 59.0, TextVQA 44.1, DocVQA (newer builds) ~70; punches at 7B level for size
~2GB VRAM, runs easily on CPU / Raspberry-Pi-class
ollama
ollama run moondreamdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run moondream
Hugging Face ↗
· ollama
LocateAnything-3B
1.8 GBnvidia · 3B · other · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model nvidia/LocateAnything-3B
Hugging Face ↗
· vllm
Qwen3-VL 2B Instruct
1.9 GBAlibaba Qwen · 2B · Apache-2.0
📊 MMMU ~57, strong OCR (32 languages), DocVQA ~92; current-gen (2025) successor to Qwen2.5-VL
~3GB VRAM, runs comfortably on CPU / 8GB laptop
ollama
ollama run qwen3-vl:2bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3-vl:2b
Hugging Face ↗
· ollama
Qwen3.5-4B
2.4 GBQwen · 4B · apache-2.0 · discovered
~4GB VRAM, or CPU with 4GB RAM
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model Qwen/Qwen3.5-4B
Hugging Face ↗
· vllm
Phi-3.5-vision Instruct
2.5 GBMicrosoft · 4.2B · MIT
📊 MMMU 43.0, MMBench 81.9, TextVQA 72.0, multi-frame/video summarization; 128K context
~5GB VRAM FP16 (~3GB Q4), runs on CPU
transformers
ollama run hf.co/SilverFishK/Phi-3.5-vision-instruct-GGUF (community GGUF; vision support varies) — transformers recommendeddocker
docker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModelForCausalLM, AutoProcessor; AutoModelForCausalLM.from_pretrained('microsoft/Phi-3.5-vision-instruct', trust_remote_code=True)"
Hugging Face ↗
· transformers
InternVL2.5 4B Instruct
2.8 GBOpenGVLab (Shanghai AI Lab) · 4B · MIT (LLM component: based on Phi-3-mini / Qwen2)
📊 MMMU ~52, DocVQA ~91, OCRBench strong
~4GB VRAM, runs on CPU
transformers
ollama run hf.co/mradermacher/InternVL2_5-4B-GGUF:Q4_K_M (community) — or use transformers/lmdeploydocker
docker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModel; AutoModel.from_pretrained('OpenGVLab/InternVL2_5-4B', trust_remote_code=True)"
Hugging Face ↗
· transformers
Qwen2.5-VL 3B Instruct
3.2 GBAlibaba Qwen · 3B · Qwen Research License (3B/7B research; non-commercial constraints)
📊 MMMU 53.1, DocVQA ~93, OCRBench strong
~4GB VRAM, runs on CPU / 8GB laptop
ollama
ollama run qwen2.5vl:3bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5vl:3b
Hugging Face ↗
· ollama
Gemma 3 4B (vision)
3.3 GBGoogle DeepMind · 4B · Gemma Terms of Use
📊 MMMU ~39, DocVQA ~73, TextVQA strong; 128K context, 140+ languages
~4GB VRAM, runs on CPU / 8GB laptop
ollama
ollama run gemma3:4bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma3:4b
Hugging Face ↗
· ollama
Qwen3-VL 4B Instruct
3.3 GBAlibaba Qwen · 4B · Apache-2.0
📊 MMMU ~63, DocVQA ~94, ChartQA strong, 32-language OCR; current-gen 2025
~5GB VRAM, runs on CPU
ollama
ollama run qwen3-vl:4bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3-vl:4b
Hugging Face ↗
· ollama
Granite Vision 3.3 2B
3.6 GBIBM · 2B · Apache-2.0
📊 DocVQA, ChartQA, AI2D, OCRBench rival/beat Llama 3.2 11B Vision & Pixtral 12B on enterprise doc tasks; tuned for visual document understanding
~4GB VRAM, runs on CPU
ollama
ollama run ibm/granite3.3-vision:2bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run ibm/granite3.3-vision:2b
Hugging Face ↗
· ollama
LLaVA 1.6 (NeXT) 7B Mistral
4.5 GBLiu et al. / LLaVA team · 7B · Apache-2.0 (Mistral base)
📊 MMMU 35.3, improved OCR/chart reading vs LLaVA-1.5; dynamic hi-res tiling up to 672x672
~6GB VRAM, runs on CPU
ollama
ollama run llava:7b-v1.6-mistral-q4_K_Sdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llava:7b-v1.6-mistral-q4_K_S
Hugging Face ↗
· ollama
SmolVLM 2.2B Instruct
4.5 GBHugging Face · 2.2B · Apache-2.0
📊 MMMU 42.0, DocVQA 80.0, TextVQA strong, Video-MME 52.1; best memory efficiency in class
~5GB VRAM FP16 (or ~2GB at Q4), runs on CPU
transformers
ollama run hf.co/HuggingFaceTB/SmolVLM-Instruct (GGUF community build) — or use transformersdocker
docker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface huggingface/transformers-pytorch-gpu python -c "from transformers import AutoModelForImageTextToText; AutoModelForImageTextToText.from_pretrained('HuggingFaceTB/SmolVLM-Instruct')"
Hugging Face ↗
· transformers
Qwable-9B-Claude-Fable-5-GGUF
5.4 GBempero-ai · 9B · apache-2.0 · discovered
~7GB VRAM, or CPU with 9GB RAM
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model empero-ai/Qwable-9B-Claude-Fable-5-GGUF
Hugging Face ↗
· vllm
MiniCPM-V 2.6
5.5 GBOpenBMB (Tsinghua) · 8B · MiniCPM Model License (free commercial use with registration)
📊 OpenCompass ~65, MMMU ~49, DocVQA ~90, OCRBench ~85 (SOTA among small models); GPT-4V-level multi-image & video
~7GB VRAM, runs on CPU; designed to run on phones
ollama
ollama run minicpm-vdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run minicpm-v
Hugging Face ↗
· ollama
InternVL3 8B Instruct
5.5 GBOpenGVLab (Shanghai AI Lab) · 8B · MIT (LLM component: Qwen2.5 license)
📊 MMMU ~73, DocVQA 92.7, ChartQA/InfoVQA strong, OCRBench ~88; among best open 8B VLMs
~8GB VRAM at Q4 (fits 12GB GPU); ~18GB FP16
vllm
ollama run hf.co/mradermacher/InternVL3-8B-GGUF:Q4_K_M (community) — or use lmdeploy/vLLMdocker
docker run --runtime nvidia --gpus all -p 8000:8000 vllm/vllm-openai:latest --model OpenGVLab/InternVL3-8B --trust-remote-code
Hugging Face ↗
· vllm
Qwen2.5-VL 7B Instruct
6 GBAlibaba Qwen · 7B · Apache-2.0
📊 MMMU 58.6, DocVQA 95.7, ChartQA ~87, OCRBench ~86; beats Llama 3.2 11B Vision on most VQA
~7GB VRAM (fits 12GB GPU), runs on CPU
ollama
ollama run qwen2.5vl:7bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5vl:7b
Hugging Face ↗
· ollama
Unlimited-OCR
6 GBbaidu · mit · discovered
~8GB VRAM (RTX 3090/4090)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model baidu/Unlimited-OCR
Hugging Face ↗
· vllm
MiniMax-M3
6 GBMiniMaxAI · other · discovered
~8GB VRAM (RTX 3090/4090)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model MiniMaxAI/MiniMax-M3
Hugging Face ↗
· vllm
Kimi-K2.7-Code
6 GBmoonshotai · other · discovered
~8GB VRAM (RTX 3090/4090)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model moonshotai/Kimi-K2.7-Code
Hugging Face ↗
· vllm
lift
6 GBdatalab-to · openrail · discovered
~8GB VRAM (RTX 3090/4090)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model datalab-to/lift
Hugging Face ↗
· vllm
GLM-OCR
6 GBzai-org · mit · discovered
~8GB VRAM (RTX 3090/4090)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model zai-org/GLM-OCR
Hugging Face ↗
· vllm
Qwen3-VL 8B Instruct
6.1 GBAlibaba Qwen · 8B · Apache-2.0
📊 MMMU ~69, DocVQA ~95, MathVista strong; Qwen3-VL family scores up to MMMU 80.6 at largest sizes
~8GB VRAM (fits 12GB GPU), runs on CPU slowly
ollama
ollama run qwen3-vl:8bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3-vl:8b
Hugging Face ↗
· ollama
gemma-4-12b-it-GGUF
7.2 GBunsloth · 12B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model unsloth/gemma-4-12b-it-GGUF
Hugging Face ↗
· vllm
Llama 3.2 Vision 11B Instruct
7.8 GBMeta · 11B · Llama 3.2 Community License (gated; <700M MAU)
📊 MMMU 50.7, DocVQA 88.4, ChartQA ~83, AI2D ~91, VQAv2 ~75
~9GB VRAM (fits 12-16GB GPU), runs on CPU
ollama
ollama run llama3.2-vision:11bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llama3.2-vision:11b
Hugging Face ↗
· ollama
LLaVA 1.6 (NeXT) 13B Vicuna
8 GBLiu et al. / LLaVA team · 13B · LLaMA-2 Community License (Vicuna base) + Apache (LLaVA weights)
📊 MMMU ~36, MMBench ~70, better text-in-image reading than 1.5
~10GB VRAM (fits 12-16GB GPU), runs on CPU
ollama
ollama run llava:13b-v1.6-vicuna-q4_K_Sdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llava:13b-v1.6-vicuna-q4_K_S
Hugging Face ↗
· ollama
Pixtral 12B (2409)
8 GBMistral AI · 12B · Apache-2.0
📊 MMMU 52.5 (CoT), DocVQA 90.7 (ANLS), ChartQA ~82, VQAv2 ~78; strong multi-image
~9GB VRAM at Q4 (fits 12-16GB GPU); ~24GB FP16
vllm
ollama run hf.co/mradermacher/Pixtral-12B-2409-GGUF:Q4_K_M (community GGUF) — vLLM recommendeddocker
docker run --runtime nvidia --gpus all -p 8000:8000 vllm/vllm-openai:latest --model mistralai/Pixtral-12B-2409 --tokenizer-mode mistral --limit-mm-per-prompt 'image=4'
Hugging Face ↗
· vllm
Gemma 3 12B (vision)
8.1 GBGoogle DeepMind · 12B · Gemma Terms of Use
📊 MMMU 50.3, DocVQA 82.3, InfoVQA/ChartQA strong; 128K context
~9GB VRAM (fits 12-16GB GPU), runs on CPU
ollama
ollama run gemma3:12bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma3:12b
Hugging Face ↗
· ollama
diffusiongemma-26B-A4B-it
15.6 GBgoogle · 26B · apache-2.0 · discovered
~18GB VRAM (RTX 3090/4090)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model google/diffusiongemma-26B-A4B-it
Hugging Face ↗
· vllm
gemma-4-26B-A4B-it-qat-GGUF
15.6 GBunsloth · 26B · apache-2.0 · discovered
~18GB VRAM (RTX 3090/4090)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model unsloth/gemma-4-26B-A4B-it-qat-GGUF
Hugging Face ↗
· vllm
gemma-4-26B-A4B-it-GGUF
15.6 GBunsloth · 26B · apache-2.0 · discovered
~18GB VRAM (RTX 3090/4090)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model unsloth/gemma-4-26B-A4B-it-GGUF
Hugging Face ↗
· vllm
Qwopus3.6-27B-Coder-Compat-MTP-GGUF
16.2 GBJackrong · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model Jackrong/Qwopus3.6-27B-Coder-Compat-MTP-GGUF
Hugging Face ↗
· vllm
Qwopus3.6-27B-Coder-MTP-GGUF
16.2 GBJackrong · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF
Hugging Face ↗
· vllm
Qwen3.6-27B-MTP-GGUF
16.2 GBunsloth · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model unsloth/Qwen3.6-27B-MTP-GGUF
Hugging Face ↗
· vllm
Qwen3.6-27B-MTP-pi-reasoning-GGUF
16.2 GBbytkim · 27B · apache-2.0 · discovered
~20GB VRAM (24GB GPU)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF
Hugging Face ↗
· vllm
Gemma 3 27B (vision)
17 GBGoogle DeepMind · 27B · Gemma Terms of Use
📊 MMMU 56.1, DocVQA 85.6, ChartQA/AI2D strong; competitive with much larger models
~18GB VRAM (fits 24GB GPU), CPU possible
ollama
ollama run gemma3:27bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run gemma3:27b
Hugging Face ↗
· ollama
gemma-4-31B-it
18.6 GBgoogle · 31B · apache-2.0 · discovered
~22GB VRAM (24GB GPU)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model google/gemma-4-31B-it
Hugging Face ↗
· vllm
Qwen3-VL 30B-A3B Instruct (MoE)
20 GBAlibaba Qwen · 30B · Apache-2.0
📊 MMMU ~73-75, DocVQA ~96; MoE with only ~3B active params so runs fast
~22GB VRAM (fits 24GB GPU at Q4); MoE keeps it fast on CPU
ollama
ollama run qwen3-vl:30b-a3bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen3-vl:30b-a3b
Hugging Face ↗
· ollama
LLaVA 1.6 (NeXT) 34B (Yi)
20 GBLiu et al. / LLaVA team · 34B · Yi License (Apache-like, free commercial with registration)
📊 MMMU ~46, MMBench ~79; strongest LLaVA-NeXT tier
~22GB VRAM at Q4 (fits 24GB GPU)
ollama
ollama run llava:34b-v1.6-q4_K_Sdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run llava:34b-v1.6-q4_K_S
Hugging Face ↗
· ollama
Qwen2.5-VL 32B Instruct
21 GBAlibaba Qwen · 32B · Apache-2.0
📊 MMMU ~70, DocVQA ~94, MathVista strong; near 72B quality
~23GB VRAM at Q4 (fits 24GB GPU); CPU possible but slow
ollama
ollama run qwen2.5vl:32bdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run qwen2.5vl:32b
Hugging Face ↗
· ollama
Qwen3.6-35B-A3B
21 GBQwen · 35B · apache-2.0 · discovered
~24GB VRAM (24GB GPU)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model Qwen/Qwen3.6-35B-A3B
Hugging Face ↗
· vllm
Qwen3.6-35B-A3B-MTP-GGUF
21 GBunsloth · 35B · apache-2.0 · discovered
~24GB VRAM (24GB GPU)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model unsloth/Qwen3.6-35B-A3B-MTP-GGUF
Hugging Face ↗
· vllm
Qwen3.6-35B-A3B-StyleTune
21 GBGryphe · 35B · apache-2.0 · discovered
~24GB VRAM (24GB GPU)
docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model Gryphe/Qwen3.6-35B-A3B-StyleTune
Hugging Face ↗
· vllm
🎙 Speech-to-text 45
Moonshine Tiny
0.19 GBUseful Sensors / Moonshine AI · 0.027B · MIT
📊 27M params; better average WER than Whisper tiny.en while ~5x faster. v2 Tiny hits ~50ms latency (5.8x faster than Whisper Tiny). English. Variable-length input (no fixed 30s padding) = big edge speedup.
<0.5GB; CPU/edge-first (designed for memory-constrained microcontrollers/SBCs)
transformers
pip install useful-moonshine; python -c "import moonshine; print(moonshine.transcribe('audio.wav','moonshine/tiny'))" # ONNX: moonshine.transcribe_with_onnxdocker
docker run -it -v $(pwd):/data python:3.11 bash -c "pip install useful-moonshine && python -c \"import moonshine; print(moonshine.transcribe('/data/audio.wav','moonshine/tiny'))\""
Hugging Face ↗
· transformers
Moonshine Base
0.237 GBUseful Sensors / Moonshine AI · 0.061B · MIT
📊 61M params, 237MB on disk; beats Whisper base.en on average WER while running much faster on CPU. English. Variable-length encoder avoids Whisper's fixed-window overhead.
<1GB; CPU/edge-first
transformers
pip install useful-moonshine; python -c "import moonshine; print(moonshine.transcribe('audio.wav','moonshine/base'))" # ONNX: moonshine.transcribe_with_onnxdocker
docker run -it -v $(pwd):/data python:3.11 bash -c "pip install useful-moonshine && python -c \"import moonshine; print(moonshine.transcribe('/data/audio.wav','moonshine/base'))\""
Hugging Face ↗
· transformers
NVIDIA Parakeet TDT-CTC 110M
0.46 GBNVIDIA · 0.11B · CC-BY-4.0
📊 Compact FastConformer hybrid TDT+CTC; competitive English WER for its size, very high RTFx. Good edge/streaming candidate.
~1GB VRAM; can run CPU
transformers
pip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt_ctc-110m'); print(m.transcribe(['audio.wav'])[0].text)"docker
docker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt_ctc-110m'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗
· transformers
nemotron-3.5-asr-streaming-0.6b
0.6 GBnvidia · 0.6B · other · discovered
runs on CPU / any laptop
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: nvidia/nemotron-3.5-asr-streaming-0.6b
Hugging Face ↗
· faster-whisper
ark-asr-0.6b-int8-onnx
0.6 GBAutoArk-AI · 0.6B · apache-2.0 · discovered
runs on CPU / any laptop
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: AutoArk-AI/ark-asr-0.6b-int8-onnx
Hugging Face ↗
· faster-whisper
nemotron-speech-streaming-en-0.6b
0.6 GBnvidia · 0.6B · other · discovered
runs on CPU / any laptop
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: nvidia/nemotron-speech-streaming-en-0.6b
Hugging Face ↗
· faster-whisper
Qwen3-ASR-0.6B
0.6 GBQwen · 0.6B · apache-2.0 · discovered
runs on CPU / any laptop
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: Qwen/Qwen3-ASR-0.6B
Hugging Face ↗
· faster-whisper
NVIDIA Canary-180m-Flash
0.73 GBNVIDIA · 0.182B · CC-BY-4.0
📊 >1200 RTFx (extremely fast); 4 languages (en/de/fr/es) ASR + translation. Strong accuracy-per-param for a 182M model. Word-level timestamps.
~1-2GB VRAM; can run CPU
transformers
pip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-180m-flash'); print(m.transcribe(['audio.wav'])[0].text)"docker
docker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-180m-flash'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗
· transformers
SenseVoice-Small
0.94 GBFunAudioLLM (Alibaba) · 0.234B · Apache-2.0 (model: 'model-license', code Apache-2.0)
📊 Non-autoregressive; >5x faster than Whisper-Small and ~15x faster than Whisper-Large; latency <80ms. Beats Whisper on Chinese/Cantonese benchmarks (e.g. AISHELL-1). 50+ languages incl. zh/en/yue/ja/ko, plus emotion (SER) + audio-event detection (AED) + ITN.
~1-2GB VRAM; runs well on CPU
transformers
pip install funasr; python -c "from funasr import AutoModel; m=AutoModel(model='FunAudioLLM/SenseVoiceSmall',hub='hf'); print(m.generate(input='audio.mp3',language='auto',use_itn=True)[0]['text'])"docker
docker run --gpus all -it -v $(pwd):/data registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:latest-cuda python -c "from funasr import AutoModel; m=AutoModel(model='FunAudioLLM/SenseVoiceSmall',hub='hf'); print(m.generate(input='/data/audio.mp3',language='auto',use_itn=True)[0]['text'])"
Hugging Face ↗
· transformers
speaker-diarization-3.1
1 GBpyannote · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: pyannote/speaker-diarization-3.1
Hugging Face ↗
· faster-whisper
speaker-diarization-community-1
1 GBpyannote · cc-by-4.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: pyannote/speaker-diarization-community-1
Hugging Face ↗
· faster-whisper
cohere-transcribe-03-2026
1 GBCohereLabs · apache-2.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: CohereLabs/cohere-transcribe-03-2026
Hugging Face ↗
· faster-whisper
whisper.cpp
1 GBggerganov · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: ggerganov/whisper.cpp
Hugging Face ↗
· faster-whisper
VibeVoice-ASR
1 GBmicrosoft · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: microsoft/VibeVoice-ASR
Hugging Face ↗
· faster-whisper
medasr
1 GBgoogle · other · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: google/medasr
Hugging Face ↗
· faster-whisper
GLM-ASR-Nano-2512
1 GBzai-org · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: zai-org/GLM-ASR-Nano-2512
Hugging Face ↗
· faster-whisper
Fun-ASR-Nano-2512
1 GBFunAudioLLM · apache-2.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: FunAudioLLM/Fun-ASR-Nano-2512
Hugging Face ↗
· faster-whisper
parakeet-cpp-gguf
1 GBmudler · cc-by-4.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: mudler/parakeet-cpp-gguf
Hugging Face ↗
· faster-whisper
GigaAM-v3
1 GBai-sage · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: ai-sage/GigaAM-v3
Hugging Face ↗
· faster-whisper
fastconformer-quran-ar
1 GBmohammed · cc-by-4.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: mohammed/fastconformer-quran-ar
Hugging Face ↗
· faster-whisper
whisper-hinglish-preview
1 GBTrelis · apache-2.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: Trelis/whisper-hinglish-preview
Hugging Face ↗
· faster-whisper
kotoba-whisper-v2.2
1 GBkotoba-tech · apache-2.0 · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: kotoba-tech/kotoba-whisper-v2.2
Hugging Face ↗
· faster-whisper
anime-whisper
1 GBlitagin · mit · discovered
~2GB VRAM, or CPU with 2GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: litagin/anime-whisper
Hugging Face ↗
· faster-whisper
wav2vec2-XLS-R 300M (multilingual base)
1.2 GBMeta (Facebook AI) · 0.3B · Apache-2.0
📊 Pretrained on 436k hrs across 128 languages. Not directly an ASR head — needs fine-tuning (CTC) per language; fine-tuned variants reach competitive multilingual WER (e.g. Common Voice). Foundation for many community ASR models.
~1-2GB VRAM; CPU works for inference
transformers
# pretrained-only: fine-tune then run. pip install transformers torch; python -c "from transformers import pipeline; p=pipeline('automatic-speech-recognition','<your-finetuned-xls-r-300m>'); print(p('audio.wav')['text'])"docker
docker run --gpus all -it -v $(pwd):/data huggingface/transformers-pytorch-gpu python -c "from transformers import Wav2Vec2Model; m=Wav2Vec2Model.from_pretrained('facebook/wav2vec2-xls-r-300m'); print('loaded')"
Hugging Face ↗
· transformers
wav2vec2 large-960h-lv60-self (English)
1.26 GBMeta (Facebook AI) · 0.317B · Apache-2.0
📊 1.8% / 3.3% WER on LibriSpeech test-clean / test-other (CTC, self-training on 960h + 53k unlabeled). English-only, no built-in punctuation. With 10 min labeled data still ~4.8/8.2 WER.
~1-2GB VRAM; runs on CPU
transformers
pip install transformers torch torchaudio; python -c "from transformers import pipeline; p=pipeline('automatic-speech-recognition','facebook/wav2vec2-large-960h-lv60-self'); print(p('audio.wav')['text'])"docker
docker run --gpus all -it -v $(pwd):/data huggingface/transformers-pytorch-gpu python -c "from transformers import pipeline; p=pipeline('automatic-speech-recognition','facebook/wav2vec2-large-960h-lv60-self'); print(p('/data/audio.wav')['text'])"
Hugging Face ↗
· transformers
Whisper large-v3-turbo
1.5 GBOpenAI · 0.809B · MIT
📊 ~3-4% WER LibriSpeech test-clean; only 0.3-0.7pt WER worse than large-v2 but ~6-8x faster (4 decoder layers vs 32). 99-language multilingual.
~4-6GB VRAM; q5_0 fits ~2GB; usable on CPU
whisper.cpp
./download-ggml-model.sh large-v3-turbo && ./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wavdocker
docker run -it -v $(pwd):/audio ghcr.io/ggml-org/whisper.cpp:main "./build/bin/whisper-cli -m /models/ggml-large-v3-turbo.bin -f /audio/audio.wav"
Hugging Face ↗
· whisper.cpp
Distil-Whisper distil-large-v3
1.5 GBHugging Face · 0.756B · MIT
📊 Within 1.5% WER of large-v3 on OOD short-form, within 1% on long-form, +0.1% better on chunked long-form. ~6x faster than large-v3. English-only.
~2-3GB VRAM FP16; CPU usable via CT2/GGML
transformers
pip install transformers torch; python -c "from transformers import pipeline; p=pipeline('automatic-speech-recognition','distil-whisper/distil-large-v3',torch_dtype='float16',device='cuda'); print(p('audio.wav')['text'])"docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # set model=Systran/faster-distil-whisper-large-v3 (CTranslate2 build)
Hugging Face ↗
· transformers
Distil-Whisper distil-large-v3.5
1.5 GBHugging Face · 0.756B · MIT
📊 Short-form 7.10 WER vs large-v3's 7.14 (slightly better); long-form 10.04 vs 8.82 (a bit worse). ~1.5x faster than large-v3-turbo on long-form. Trained on 98k hrs with patient teacher + SpecAugment.
~2-3GB VRAM FP16; CPU via CT2/ONNX builds
transformers
pip install transformers torch; python -c "from transformers import pipeline; p=pipeline('automatic-speech-recognition','distil-whisper/distil-large-v3.5',torch_dtype='float16',device='cuda'); print(p('audio.wav')['text'])"docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # set model=distil-whisper/distil-large-v3.5-ct2
Hugging Face ↗
· transformers
whisper-large-v3
1.6 GBopenai · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: openai/whisper-large-v3
Hugging Face ↗
· faster-whisper
whisper-large-v3-turbo
1.6 GBopenai · mit · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: openai/whisper-large-v3-turbo
Hugging Face ↗
· faster-whisper
kazakh-whisper-large-v3-turbo
1.6 GBshyngys879 · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: shyngys879/kazakh-whisper-large-v3-turbo
Hugging Face ↗
· faster-whisper
seamless-m4t-v2-large
1.6 GBfacebook · cc-by-nc-4.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: facebook/seamless-m4t-v2-large
Hugging Face ↗
· faster-whisper
faster-whisper large-v3-turbo (CTranslate2)
1.62 GBSYSTRAN / deepdml / OpenAI weights · 0.809B · MIT
📊 Near large-v3-turbo quality (~3-4% WER LibriSpeech clean) at very high throughput; combines turbo's pruned decoder with CTranslate2 speedups. Sub-second latency feasible.
~1.5-2GB VRAM FP16; int8 <1GB; good on CPU
faster-whisper
pip install faster-whisper; python -c "from faster_whisper import WhisperModel; m=WhisperModel('deepdml/faster-whisper-large-v3-turbo-ct2',device='cuda',compute_type='float16'); [print(s.text) for s,_ in [m.transcribe('audio.wav')][0]]"docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # set model=deepdml/faster-whisper-large-v3-turbo-ct2
Hugging Face ↗
· faster-whisper
Qwen3-ASR-1.7B
1.7 GBQwen · 1.7B · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: Qwen/Qwen3-ASR-1.7B
Hugging Face ↗
· faster-whisper
Kyutai STT 1B (en/fr, streaming)
2 GBKyutai · 1B · CC-BY-4.0
📊 Streaming STT with ~0.5s delay + semantic VAD; English & French. Word-level timestamps; robust to noise. Built on Mimi codec + Moshi-style autoregressive decoder. Trained on 2.5M hrs.
~2-4GB VRAM; designed for real-time streaming on GPU
transformers
pip install moshi; python -m moshi.run_inference --hf-repo kyutai/stt-1b-en_fr audio.wav # or use transformers KyutaiSpeechToTextdocker
docker run --gpus all -it -v $(pwd):/data python:3.11 bash -c "pip install moshi && python -m moshi.run_inference --hf-repo kyutai/stt-1b-en_fr /data/audio.wav"
Hugging Face ↗
· transformers
granite-speech-4.1-2b
2 GBibm-granite · 2B · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: ibm-granite/granite-speech-4.1-2b
Hugging Face ↗
· faster-whisper
NVIDIA Parakeet TDT 0.6B v2 (English)
2.4 GBNVIDIA · 0.6B · CC-BY-4.0
📊 Open ASR Leaderboard avg 6.05% WER (was #1 at release, May 2025). LibriSpeech test-clean 1.69%, test-other 3.19%. RTFx >3000 — transcribes ~1hr audio per second on GPU. English-only.
~2-4GB VRAM; needs NVIDIA GPU (CUDA) for best speed; CPU possible but slow
transformers
pip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v2'); print(m.transcribe(['audio.wav'])[0].text)"docker
docker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v2'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗
· transformers
NVIDIA Parakeet TDT 0.6B v3 (multilingual)
2.4 GBNVIDIA · 0.6B · CC-BY-4.0
📊 Open ASR Leaderboard avg 6.34% WER. LibriSpeech test-clean 1.93%. Multilingual Fleurs avg 11.97% WER across 25 European languages. RTFx >3000. Auto language detection.
~2-4GB VRAM; NVIDIA GPU recommended
transformers
pip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v3'); print(m.transcribe(['audio.wav'])[0].text)"docker
docker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v3'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗
· transformers
ARK-ASR-3B
3 GBAutoArk-AI · 3B · apache-2.0 · discovered
~4GB VRAM, or CPU with 5GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: AutoArk-AI/ARK-ASR-3B
Hugging Face ↗
· faster-whisper
faster-whisper large-v3 (CTranslate2)
3.09 GBSYSTRAN / OpenAI weights · 1.54B · MIT
📊 Same accuracy as Whisper large-v3 (~1.8-2.7% WER LibriSpeech clean) but up to 4x faster and lower memory via CTranslate2. int8 quant adds speed with minimal WER loss.
~3GB VRAM FP16; int8 ~1.5-2GB; strong CPU performance
faster-whisper
pip install faster-whisper; python -c "from faster_whisper import WhisperModel; m=WhisperModel('large-v3',device='cuda',compute_type='float16'); [print(s.text) for s,_ in [m.transcribe('audio.wav')][0]]"docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # OpenAI-compatible /v1/audio/transcriptions, set model=Systran/faster-whisper-large-v3
Hugging Face ↗
· faster-whisper
NVIDIA Canary-1B-Flash
3.5 GBNVIDIA · 0.883B · CC-BY-4.0
📊 Avg WER ~6.67% on Open ASR Leaderboard; >1000 RTFx (much faster than original Canary-1B). 4 languages (en/de/fr/es) ASR + En<->X translation with optional punctuation/capitalization.
~3-5GB VRAM; NVIDIA GPU recommended
transformers
pip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-1b-flash'); print(m.transcribe(['audio.wav'])[0].text)"docker
docker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-1b-flash'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗
· transformers
NVIDIA Canary-1B-v2 (multilingual ASR+AST)
4 GBNVIDIA · 0.978B · CC-BY-4.0
📊 Top-tier on Open ASR Leaderboard (~5.6-6.7% WER region); 25 European languages, ASR + speech translation (X<->En). Encoder-decoder FastConformer + Transformer. Word-level timestamps.
~4-6GB VRAM; NVIDIA GPU recommended
transformers
pip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-1b-v2'); print(m.transcribe(['audio.wav'])[0].text)"docker
docker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/canary-1b-v2'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗
· transformers
Voxtral-Mini-4B-Realtime-2602
4 GBmistralai · 4B · apache-2.0 · discovered
~5GB VRAM, or CPU with 6GB RAM
docker
docker run --gpus all -p 8000:8000 fedirz/faster-whisper-server:latest-cuda # model: mistralai/Voxtral-Mini-4B-Realtime-2602
Hugging Face ↗
· faster-whisper
NVIDIA Parakeet TDT 1.1B
4.5 GBNVIDIA · 1.1B · CC-BY-4.0
📊 ~6.0-6.5% avg WER region on Open ASR Leaderboard; trained on 64k+ hrs. Larger encoder than 0.6B for marginal accuracy gains. English.
~4-6GB VRAM; NVIDIA GPU recommended
transformers
pip install -U 'nemo_toolkit[asr]'; python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-1.1b'); print(m.transcribe(['audio.wav'])[0].text)"docker
docker run --gpus all -it -v $(pwd):/data nvcr.io/nvidia/nemo:24.12 python -c "import nemo.collections.asr as na; m=na.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-1.1b'); print(m.transcribe(['/data/audio.wav'])[0].text)"
Hugging Face ↗
· transformers
Kyutai STT 2.6B (English, high accuracy)
9 GBKyutai · 2.6B · CC-BY-4.0
📊 ~6.4% WER; English-only, optimized for max accuracy with a 2.5s delay. Robust in noisy conditions and on audio up to ~2 hours. A H100 can serve ~400 streams in real-time.
~6-9GB VRAM (bf16); GPU recommended; MLX build runs on Apple Silicon
transformers
pip install moshi; python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio.wav # or transformers KyutaiSpeechToTextdocker
docker run --gpus all -it -v $(pwd):/data python:3.11 bash -c "pip install moshi && python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en /data/audio.wav"
Hugging Face ↗
· transformers
🔊 Text-to-speech 38
Kitten-TTS Nano 0.1 (int8)
0.025 GBKittenML · 0.015B · Apache-2.0
📊 No published MOS; positioned as 'SoTA under 25MB'. 24kHz output, 8 voices. Real-time on CPU including phones/Raspberry Pi.
0GB VRAM, CPU-only (runs on phones / <1GB RAM)
transformers
pip install kittentts soundfile && python -c "from kittentts import KittenTTS; import soundfile as sf; m=KittenTTS('KittenML/kitten-tts-nano-0.1'); sf.write('out.wav', m.generate('Hello world', voice='expr-voice-2-f'), 24000)"docker
docker run -d -p 8000:8000 ghcr.io/devnen/kitten-tts-server:latest # devnen/Kitten-TTS-Server, Web UI + OpenAI-compatible API
Hugging Face ↗
· transformers
Piper (e.g. en_US-lessac-medium)
0.06 GBRhasspy / Open Home Foundation · 0.015B · MIT
📊 No formal MOS; VITS-based; ~10x real-time on desktop CPU, real-time on Raspberry Pi 5. Medium voices 22.05kHz, high 22.05kHz.
0GB VRAM, CPU-only by design; tiny RAM footprint
piper
pip install piper-tts && echo 'Hello world' | piper -m en_US-lessac-medium.onnx -f out.wav # download voices from https://huggingface.co/rhasspy/piper-voicesdocker
docker run --rm -v $PWD:/data -e PIPER_VOICE=en_US-lessac-medium lscr.io/linuxserver/piper:latest # or rhasspy/wyoming-piper
Hugging Face ↗
· piper
MeloTTS (English v3)
0.21 GBMyShell.ai + MIT · 0.05B · MIT
📊 VITS-based; CPU real-time capable. No formal MOS published but widely used; clear, natural multilingual speech. ~44.1kHz internal.
~1GB VRAM; fast CPU real-time inference
transformers
pip install git+https://github.com/myshell-ai/MeloTTS.git && python -m unidic download && python -c "from melo.api import TTS; t=TTS(language='EN', device='cpu'); t.tts_to_file('Hello world', t.hps.data.spk2id['EN-US'], 'out.wav')"docker
docker run -d -p 8888:8888 --gpus all ghcr.io/myshell-ai/melotts:latest # official MeloTTS image with web UI
Hugging Face ↗
· transformers
Kokoro-82M (v1.0)
0.33 GBhexgrad · 0.082B · Apache-2.0
📊 Was #1 in TTS Spaces Arena (Dec 2024) at only 82M params, beating much larger models on naturalness ELO. ~24kHz. Sub-real-time on CPU, very fast on GPU.
~1GB VRAM; runs comfortably on CPU
transformers
pip install -q 'kokoro>=0.9.2' soundfile && python -c "from kokoro import KPipeline; import soundfile as sf; p=KPipeline(lang_code='a'); g=p('Hello world', voice='af_heart'); [sf.write(f'{i}.wav', a, 24000) for i,(_,_,a) in enumerate(g)]"docker
docker run -d -p 8880:8880 ghcr.io/remsky/kokoro-fastapi:latest # remsky/Kokoro-FastAPI, OpenAI-compatible /v1/audio/speech
Hugging Face ↗
· transformers
OpenVoice V2
0.4 GBMyShell.ai + MIT · 0.1B · MIT
📊 Tone-color conversion step <100ms; instant zero-shot voice cloning. Quality inherits from MeloTTS base. No single MOS, but strong cross-lingual cloning fidelity.
~2GB VRAM; runs on CPU
transformers
pip install git+https://github.com/myshell-ai/OpenVoice.git && python -c "from openvoice.api import ToneColorConverter" # MeloTTS base + tone-color converter; clone from ~6s referencedocker
docker run -d -p 8000:8000 --gpus all ghcr.io/myshell-ai/openvoice:v2 # or any python:3.10 image with the repo installed
Hugging Face ↗
· transformers
Inflect-Nano-v1
0.5 GBowensong · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load owensong/Inflect-Nano-v1 (see model card for TTS pipeline)
Hugging Face ↗
· transformers
MOSS-TTS-Local-Transformer-v1.5
0.5 GBOpenMOSS-Team · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 (see model card for TTS pipeline)
Hugging Face ↗
· transformers
OmniVoice
0.5 GBk2-fsa · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load k2-fsa/OmniVoice (see model card for TTS pipeline)
Hugging Face ↗
· transformers
ZONOS2
0.5 GBZyphra · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load Zyphra/ZONOS2 (see model card for TTS pipeline)
Hugging Face ↗
· transformers
VoxCPM2
0.5 GBopenbmb · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load openbmb/VoxCPM2 (see model card for TTS pipeline)
Hugging Face ↗
· transformers
dots.tts-soar
0.5 GBrednote-hilab · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load rednote-hilab/dots.tts-soar (see model card for TTS pipeline)
Hugging Face ↗
· transformers
supertonic-3
0.5 GBSupertone · openrail · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load Supertone/supertonic-3 (see model card for TTS pipeline)
Hugging Face ↗
· transformers
s2-pro
0.5 GBfishaudio · other · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load fishaudio/s2-pro (see model card for TTS pipeline)
Hugging Face ↗
· transformers
GPA-v1.5
0.5 GBAutoArk-AI · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load AutoArk-AI/GPA-v1.5 (see model card for TTS pipeline)
Hugging Face ↗
· transformers
Fun-CosyVoice3-0.5B-2512
0.5 GBFunAudioLLM · 0.5B · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load FunAudioLLM/Fun-CosyVoice3-0.5B-2512 (see model card for TTS pipeline)
Hugging Face ↗
· transformers
GPA
0.5 GBAutoArk-AI · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load AutoArk-AI/GPA (see model card for TTS pipeline)
Hugging Face ↗
· transformers
GPA-v1.5-onnx-runtime
0.5 GBAutoArk-AI · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load AutoArk-AI/GPA-v1.5-onnx-runtime (see model card for TTS pipeline)
Hugging Face ↗
· transformers
MisoTTS
0.5 GBMisoLabs · other · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load MisoLabs/MisoTTS (see model card for TTS pipeline)
Hugging Face ↗
· transformers
Dramabox
0.5 GBResembleAI · other · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load ResembleAI/Dramabox (see model card for TTS pipeline)
Hugging Face ↗
· transformers
Kokoro-Vietnamese
0.5 GBcontextboxai · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load contextboxai/Kokoro-Vietnamese (see model card for TTS pipeline)
Hugging Face ↗
· transformers
MOSS-TTS-v1.5
0.5 GBOpenMOSS-Team · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load OpenMOSS-Team/MOSS-TTS-v1.5 (see model card for TTS pipeline)
Hugging Face ↗
· transformers
VoiceTut-TTS
0.5 GBmohammedaly22 · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load mohammedaly22/VoiceTut-TTS (see model card for TTS pipeline)
Hugging Face ↗
· transformers
BlueMagpie-TTS
0.5 GBOpenFormosa · other · discovered
runs on CPU / any laptop
docker
pip install transformers torch # load OpenFormosa/BlueMagpie-TTS (see model card for TTS pipeline)
Hugging Face ↗
· transformers
StyleTTS 2 (LibriTTS)
0.78 GBYinghao Aaron Li (Columbia) / yl4579 · 0.2B · MIT
📊 LJSpeech MOS-N 4.55 vs 4.23 ground-truth (surpasses human recordings single-speaker); matches human on multispeaker VCTK. 24kHz. Diffusion-style prosody.
~2-4GB VRAM; CPU usable but slow
transformers
pip install styletts2 && python -c "from styletts2 import tts; t=tts.StyleTTS2(); t.inference('Hello world', output_wav_file='out.wav')" # LJSpeech checkpoint: yl4579/StyleTTS2-LJSpeechdocker
docker run --rm -v $PWD:/work --gpus all python:3.10 bash -lc 'pip install styletts2 && python -c "from styletts2 import tts; tts.StyleTTS2().inference(\"Hi\", output_wav_file=\"/work/out.wav\")"'
Hugging Face ↗
· transformers
F5-TTS (v1 Base)
1.35 GBSWivid (Shanghai Jiao Tong Univ.) · 0.336B · CC-BY-NC-4.0 (weights) / Apache-2.0 for OpenF5-TTS
📊 Flow-matching (non-autoregressive, no diffusion) -> fast inference + strong prosody. ~0.15-0.3 RTF on GPU. Excellent zero-shot cloning + code-switching; among the top open cloning models of 2024-25. 24kHz.
~2-4GB VRAM; CPU usable
transformers
pip install f5-tts && f5-tts_infer-cli --model F5TTS_v1_Base --ref_audio ref.wav --ref_text 'reference transcript' --gen_text 'Hello world' # or: f5-tts_infer-gradio for web UIdocker
docker run -d -p 7860:7860 --gpus all ghcr.io/swivid/f5-tts:main # official image, launches Gradio UI
Hugging Face ↗
· transformers
VibeVoice-1.5B
1.5 GBmicrosoft · 1.5B · mit · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
pip install transformers torch # load microsoft/VibeVoice-1.5B (see model card for TTS pipeline)
Hugging Face ↗
· transformers
Qwen3-TTS-12Hz-1.7B-CustomVoice
1.7 GBQwen · 1.7B · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
pip install transformers torch # load Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice (see model card for TTS pipeline)
Hugging Face ↗
· transformers
IndexTTS-2
1.8 GBIndexTeam · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
pip install transformers torch # load IndexTeam/IndexTTS-2 (see model card for TTS pipeline)
Hugging Face ↗
· transformers
Chatterbox (Multilingual v3)
2 GBResemble AI · 0.5B · MIT
📊 Vendor blind test: 65.3% preferred Chatterbox-Turbo vs 24.5% ElevenLabs (take with salt). Sub-200ms latency (Turbo ~472ms first chunk, RTF ~0.5). First open model with emotion-exaggeration control. Trained on 500K hrs.
~4-6GB VRAM (0.5B); Turbo 350M lighter; CPU supported
transformers
pip install chatterbox-tts && python -c "import torchaudio as ta; from chatterbox.tts import ChatterboxTTS; m=ChatterboxTTS.from_pretrained(device='cuda'); ta.save('out.wav', m.generate('Hello world', audio_prompt_path='ref.wav'), m.sr)"docker
docker run -d -p 8004:8004 --gpus all ghcr.io/devnen/chatterbox-tts-server:latest # devnen/Chatterbox-TTS-Server: Web UI + OpenAI-compatible API, CUDA/ROCm/CPU
Hugging Face ↗
· transformers
Sesame CSM-1B
2.1 GBSesame AI Labs · 1B · Apache-2.0
📊 Conversational/contextual prosody (uses prior turns of text+audio). Llama backbone + Mimi RVQ decoder. ~200ms-class streaming. Strong context-aware naturalness; no single MOS published.
~4-6GB VRAM (bf16); GGUF runs smaller / CPU
transformers
pip install transformers torch soundfile && python -c "from transformers import CsmForConditionalGeneration, AutoProcessor; import torch, soundfile as sf; m=CsmForConditionalGeneration.from_pretrained('sesame/csm-1b'); p=AutoProcessor.from_pretrained('sesame/csm-1b')" # gated: huggingface-cli login firstdocker
docker run --rm --gpus all -v $PWD:/work huggingface/transformers-pytorch-gpu:latest python /work/csm_infer.py # GGUF: ggml-org/sesame-csm-1b-GGUF via llama.cpp
Hugging Face ↗
· transformers
Coqui XTTS-v2
2.1 GBCoqui (community-maintained) · 0.5B · Coqui Public Model License (CPML)
📊 6-second zero-shot voice cloning, 17 languages, cross-lingual + emotion/style transfer, 24kHz. Long the community favorite for quality cloning; ~150-200ms streaming latency on GPU.
~2-3GB VRAM (FP16); CPU works but slow
transformers
pip install coqui-tts && python -c "from TTS.api import TTS; TTS('tts_models/multilingual/multi-dataset/xtts_v2').tts_to_file(text='Hello world', speaker_wav='ref.wav', language='en', file_path='out.wav')" # coqui-tts is the maintained fork of the TTS packagedocker
docker run -d -p 8020:8020 --gpus all ghcr.io/coqui-ai/xtts-streaming-server:latest # official XTTS streaming server, OpenAI-ish API
Hugging Face ↗
· transformers
Orpheus-TTS 3B (finetuned)
2.3 GBCanopy Labs · 3B · Apache-2.0
📊 Llama-3.2-3B Speech-LLM, trained 100K+ hrs English. ~200ms streaming latency (down to ~100ms with input streaming). Zero-shot cloning + inline emotion tags (<laugh>,<sigh>,<gasp>...). Claims to rival/surpass closed-source naturalness.
Q4 ~3-4GB VRAM; bf16 ~8GB; CPU via GGUF/llama.cpp
ollama
ollama run hf.co/isaiahbjork/orpheus-3b-0.1-ft-Q4_K_M-GGUF # or library mirror: ollama run legraphista/Orpheus:3b-ft-q4_k_m (Ollama emits SNAC audio tokens -> decode with orpheus-speech)docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/isaiahbjork/orpheus-3b-0.1-ft-Q4_K_M-GGUF # full audio: vllm serve canopylabs/orpheus-3b-0.1-ft
Hugging Face ↗
· ollama
Parler-TTS Mini v1
2.5 GBHugging Face · 0.88B · Apache-2.0
📊 Trained on 45K hours. Natural, controllable speech; you steer gender/pitch/pace/reverb/emotion with a natural-language description prompt. No headline MOS but fully reproducible (data+code+weights open).
~4GB VRAM; CPU possible, slow
transformers
pip install git+https://github.com/huggingface/parler-tts.git && python -c "from parler_tts import ParlerTTSForConditionalGeneration as M; from transformers import AutoTokenizer; import soundfile as sf; m=M.from_pretrained('parler-tts/parler-tts-mini-v1'); t=AutoTokenizer.from_pretrained('parler-tts/parler-tts-mini-v1')" # describe voice via text promptdocker
docker run --rm --gpus all -v $PWD:/work huggingface/transformers-pytorch-gpu:latest bash -lc 'pip install git+https://github.com/huggingface/parler-tts.git && python /work/parler.py'
Hugging Face ↗
· transformers
higgs-audio-v3-tts-4b
4 GBbosonai · 4B · other · discovered
~5GB VRAM, or CPU with 6GB RAM
docker
pip install transformers torch # load bosonai/higgs-audio-v3-tts-4b (see model card for TTS pipeline)
Hugging Face ↗
· transformers
Ming-omni-tts-16.8B-A3B
4 GBinclusionAI · 16.8B · apache-2.0 · discovered
~5GB VRAM, or CPU with 6GB RAM
docker
pip install transformers torch # load inclusionAI/Ming-omni-tts-16.8B-A3B (see model card for TTS pipeline)
Hugging Face ↗
· transformers
Voxtral-4B-TTS-2603
4 GBmistralai · 4B · cc-by-nc-4.0 · discovered
~5GB VRAM, or CPU with 6GB RAM
docker
pip install transformers torch # load mistralai/Voxtral-4B-TTS-2603 (see model card for TTS pipeline)
Hugging Face ↗
· transformers
higgs-audio-v3-tts-4b-transformers
4 GBmultimodalart · 4B · other · discovered
~5GB VRAM, or CPU with 6GB RAM
docker
pip install transformers torch # load multimodalart/higgs-audio-v3-tts-4b-transformers (see model card for TTS pipeline)
Hugging Face ↗
· transformers
Dia-1.6B
6.4 GBNari Labs · 1.6B · Apache-2.0
📊 Specialized for ultra-realistic multi-speaker DIALOGUE in one pass; handles nonverbals (laughs, coughs, throat-clear). Real-time on enterprise GPUs (~40 tok/s on A4000). 44.1kHz. Audio-conditioned emotion/tone + voice cloning from <=10s clip.
~10GB VRAM full (fits 25GB easily); bf16/int8 lowers it
transformers
pip install git+https://github.com/nari-labs/dia.git && python -c "from dia.model import Dia; m=Dia.from_pretrained('nari-labs/Dia-1.6B'); import soundfile as sf; sf.write('out.wav', m.generate('[S1] Hello. [S2] Hi there! (laughs)'), 44100)" # also in HF Transformers (DiaForConditionalGeneration)docker
docker run --rm --gpus all -v $PWD:/work huggingface/transformers-pytorch-gpu:latest python /work/dia_infer.py # requires PyTorch 2.0+ / CUDA 12.6
Hugging Face ↗
· transformers
◇ Embeddings & rerank 59
snowflake-arctic-embed-xs (v1)
0.046 GBSnowflake · 0.022B · Apache-2.0
📊 Smallest Arctic; dim 384
<0.2GB VRAM, runs anywhere incl. CPU/edge
ollama
ollama run snowflake-arctic-embed:22mdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run snowflake-arctic-embed:22m
Hugging Face ↗
· ollama
all-MiniLM-L6-v2
0.046 GBsentence-transformers (UKPLab) · 0.022B · Apache-2.0
📊 MTEB (English v1) ~56.3 avg; the classic fast/CPU baseline
<0.2GB VRAM, extremely fast on CPU
ollama
ollama run all-minilm:l6docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull all-minilm
Hugging Face ↗
· ollama
granite-embedding-30m-english
0.063 GBIBM · 0.03B · Apache-2.0
📊 Fast English retrieval, tiny footprint
<0.2GB VRAM, very fast on CPU
ollama
ollama run granite-embedding:30mdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull granite-embedding:30m
Hugging Face ↗
· ollama
GTE-small
0.067 GBAlibaba-NLP (thenlper) · 0.033B · MIT
📊 MTEB (English v1) ~61.4 avg
<0.3GB VRAM, very fast on CPU
sentence-transformers
ollama run hf.co/ChristianAzinn/gte-small-ggufdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/ChristianAzinn/gte-small-gguf
Hugging Face ↗
· sentence-transformers
e5-small-v2
0.067 GBMicrosoft (intfloat) · 0.033B · MIT
📊 MTEB (English v1) ~59.9 avg
<0.3GB VRAM, very fast on CPU
sentence-transformers
ollama run hf.co/yixuan-chia/e5-small-v2-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/e5-small-v2-GGUF
Hugging Face ↗
· sentence-transformers
snowflake-arctic-embed-s (v1)
0.067 GBSnowflake · 0.033B · Apache-2.0
📊 Compact English retrieval, dim 384
<0.3GB VRAM, very fast on CPU
ollama
ollama run snowflake-arctic-embed:33mdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run snowflake-arctic-embed:33m
Hugging Face ↗
· ollama
BGE-small-en-v1.5
0.07 GBBAAI · 0.033B · MIT
📊 MTEB (English v1) ~62.2 avg — punches above its size
<0.3GB VRAM, very fast on CPU
sentence-transformers
ollama run hf.co/CompendiumLabs/bge-small-en-v1.5-ggufdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/CompendiumLabs/bge-small-en-v1.5-gguf
Hugging Face ↗
· sentence-transformers
paraphrase-multilingual-MiniLM-L12-v2
0.13 GBsentence-transformers · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install sentence-transformers # SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
Hugging Face ↗
· sentence-transformers
all-MiniLM-L12-v2
0.13 GBsentence-transformers · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install sentence-transformers # SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
Hugging Face ↗
· sentence-transformers
multi-modal-embed-small
0.13 GBllm-semantic-router · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install sentence-transformers # SentenceTransformer("llm-semantic-router/multi-modal-embed-small")
Hugging Face ↗
· sentence-transformers
snowflake-arctic-embed-m (v1)
0.219 GBSnowflake · 0.11B · Apache-2.0
📊 MTEB retrieval ~54.9 nDCG@10 (English)
~0.5GB VRAM, fast on CPU
ollama
ollama run snowflake-arctic-embed:110mdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run snowflake-arctic-embed:110m
Hugging Face ↗
· ollama
BGE-base-en-v1.5
0.22 GBBAAI · 0.109B · MIT
📊 MTEB (English v1) ~63.5 avg
~0.5GB VRAM, fast on CPU
sentence-transformers
ollama run hf.co/CompendiumLabs/bge-base-en-v1.5-ggufdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/CompendiumLabs/bge-base-en-v1.5-gguf
Hugging Face ↗
· sentence-transformers
e5-base-v2
0.22 GBMicrosoft (intfloat) · 0.109B · MIT
📊 MTEB (English v1) ~61.5 avg
~0.5GB VRAM, fast on CPU
sentence-transformers
ollama run hf.co/yixuan-chia/e5-base-v2-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/e5-base-v2-GGUF
Hugging Face ↗
· sentence-transformers
all-mpnet-base-v2
0.22 GBsentence-transformers (UKPLab) · 0.109B · Apache-2.0
📊 MTEB (English v1) ~57.8 avg — long the best general-purpose ST model
~0.5GB VRAM, fast on CPU
sentence-transformers
ollama run hf.co/sentence-transformers/all-mpnet-base-v2docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/sentence-transformers/all-mpnet-base-v2
Hugging Face ↗
· sentence-transformers
multilingual-e5-small
0.24 GBMicrosoft (intfloat) · 0.118B · MIT
📊 Good multilingual quality for 118M params
<0.4GB VRAM, very fast on CPU
sentence-transformers
ollama run hf.co/yixuan-chia/multilingual-e5-small-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/multilingual-e5-small-GGUF
Hugging Face ↗
· sentence-transformers
nomic-embed-text-v1.5
0.274 GBNomic AI · 0.137B · Apache-2.0
📊 Beats OpenAI text-embedding-ada-002 & 3-small on short+long context; MTEB ~62
~0.5GB VRAM (522MB), runs on CPU
ollama
ollama run nomic-embed-text:v1.5docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull nomic-embed-text
Hugging Face ↗
· ollama
GTE-base-en-v1.5
0.28 GBAlibaba-NLP · 0.137B · Apache-2.0
📊 MTEB (English v1) ~64 avg
~0.6GB VRAM, fast on CPU
sentence-transformers
ollama run hf.co/ChristianAzinn/gte-base-en-v1.5-ggufdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/ChristianAzinn/gte-base-en-v1.5-gguf
Hugging Face ↗
· sentence-transformers
granite-embedding-r2 (english, 149m)
0.3 GBIBM · 0.149B · Apache-2.0
📊 2025 R2 release; improved retrieval over r1, longer context
~0.5GB VRAM, fast on CPU
sentence-transformers
ollama run hf.co/ibm-granite/granite-embedding-english-r2docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/ibm-granite/granite-embedding-english-r2
Hugging Face ↗
· sentence-transformers
LFM2.5-Embedding-350M
0.5 GBLiquidAI · other · discovered
runs on CPU / any laptop
docker
pip install sentence-transformers # SentenceTransformer("LiquidAI/LFM2.5-Embedding-350M")
Hugging Face ↗
· sentence-transformers
LFM2.5-ColBERT-350M
0.5 GBLiquidAI · other · discovered
runs on CPU / any laptop
docker
pip install sentence-transformers # SentenceTransformer("LiquidAI/LFM2.5-ColBERT-350M")
Hugging Face ↗
· sentence-transformers
LateOn-regularized
0.5 GBlightonai · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install sentence-transformers # SentenceTransformer("lightonai/LateOn-regularized")
Hugging Face ↗
· sentence-transformers
ruri-v3-310m
0.5 GBcl-nagoya · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install sentence-transformers # SentenceTransformer("cl-nagoya/ruri-v3-310m")
Hugging Face ↗
· sentence-transformers
GTE-ModernColBERT-v1
0.5 GBlightonai · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install sentence-transformers # SentenceTransformer("lightonai/GTE-ModernColBERT-v1")
Hugging Face ↗
· sentence-transformers
nomic-embed-text-v1
0.5 GBnomic-ai · apache-2.0 · discovered
runs on CPU / any laptop
docker
pip install sentence-transformers # SentenceTransformer("nomic-ai/nomic-embed-text-v1")
Hugging Face ↗
· sentence-transformers
LFM2-ColBERT-350M
0.5 GBLiquidAI · other · discovered
runs on CPU / any laptop
docker
pip install sentence-transformers # SentenceTransformer("LiquidAI/LFM2-ColBERT-350M")
Hugging Face ↗
· sentence-transformers
multilingual-e5-base
0.56 GBMicrosoft (intfloat) · 0.278B · MIT
📊 Solid multilingual MTEB, mid-size
~0.6GB VRAM, fast on CPU
sentence-transformers
ollama run hf.co/yixuan-chia/multilingual-e5-base-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/multilingual-e5-base-GGUF
Hugging Face ↗
· sentence-transformers
jina-reranker-v2-base-multilingual
0.56 GBJina AI · 0.278B · CC-BY-NC-4.0 (non-commercial)
📊 Fast multilingual cross-encoder; strong BEIR/MKQA; agentic function-calling rerank
~1GB VRAM, runs on CPU
sentence-transformers
ollama run hf.co/gpustack/jina-reranker-v2-base-multilingual-GGUFdocker
docker run --gpus all -p 8000:8000 vllm/vllm-openai --model jinaai/jina-reranker-v2-base-multilingual --task score
Hugging Face ↗
· sentence-transformers
granite-embedding-278m-multilingual
0.563 GBIBM · 0.278B · Apache-2.0
📊 Competitive multilingual retrieval; enterprise/clean-data trained
~0.7GB VRAM, fast on CPU
ollama
ollama run granite-embedding:278mdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull granite-embedding:278m
Hugging Face ↗
· ollama
snowflake-arctic-embed-m-v2.0
0.61 GBSnowflake · 0.305B · Apache-2.0
📊 Strong multilingual retrieval, smaller footprint than L-v2.0
~0.8GB VRAM, fast on CPU
sentence-transformers
ollama run hf.co/Snowflake/snowflake-arctic-embed-m-v2.0docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/Snowflake/snowflake-arctic-embed-m-v2.0
Hugging Face ↗
· sentence-transformers
EmbeddingGemma-300m
0.62 GBGoogle DeepMind · 0.308B · Gemma Terms of Use
📊 Highest-ranked open multilingual embedder under 500M on MMTEB at release (Sep 2025)
~0.6GB VRAM, runs on CPU/mobile
ollama
ollama run embeddinggemmadocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull embeddinggemma
Hugging Face ↗
· ollama
Qwen3-Embedding-0.6B
0.64 GBAlibaba Qwen · 0.6B · Apache-2.0
📊 MTEB Multilingual mean 64.33; MTEB-Code strong; instruction-aware
~1GB VRAM, runs easily on CPU
ollama
ollama run dengcao/Qwen3-Embedding-0.6B:Q8_0docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run dengcao/Qwen3-Embedding-0.6B:Q8_0
Hugging Face ↗
· ollama
Qwen3-Reranker-0.6B
0.64 GBAlibaba Qwen · 0.6B · Apache-2.0
📊 Cross-encoder reranker; strong MTEB-R / MIRACL reranking gains; instruction-aware
~1GB VRAM, runs on CPU
transformers
ollama run hf.co/Mungert/Qwen3-Reranker-0.6B-GGUFdocker
docker run --gpus all -p 8000:8000 vllm/vllm-openai --model Qwen/Qwen3-Reranker-0.6B
Hugging Face ↗
· transformers
snowflake-arctic-embed-l (v1.5)
0.669 GBSnowflake · 0.335B · Apache-2.0
📊 MTEB retrieval ~55.9 nDCG@10 (English) at release (Apr 2024)
~1.5GB VRAM, runs on CPU
ollama
ollama run snowflake-arctic-embed:335mdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run snowflake-arctic-embed:335m
Hugging Face ↗
· ollama
BGE-large-en-v1.5
0.67 GBBAAI · 0.335B · MIT
📊 MTEB (English v1) ~64.2 avg; long the default RAG baseline
~1.5GB VRAM, runs on CPU
ollama
ollama run znbang/bge:large-en-v1.5-f16docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run znbang/bge:large-en-v1.5-f16
Hugging Face ↗
· ollama
e5-large-v2
0.67 GBMicrosoft (intfloat) · 0.335B · MIT
📊 MTEB (English v1) ~62.3 avg
~1.5GB VRAM, runs on CPU
sentence-transformers
ollama run hf.co/yixuan-chia/e5-large-v2-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/e5-large-v2-GGUF
Hugging Face ↗
· sentence-transformers
mxbai-embed-large-v1
0.67 GBMixedbread AI · 0.335B · Apache-2.0
📊 MTEB (English v1) ~64.7 avg — SOTA for BERT-large size at release (Mar 2024), no MTEB-data overlap
~1.5GB VRAM, runs on CPU
ollama
ollama run mxbai-embed-large:v1docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull mxbai-embed-large
Hugging Face ↗
· ollama
GTE-large-en-v1.5
0.87 GBAlibaba-NLP · 0.434B · Apache-2.0
📊 MTEB (English v1) ~65 avg — SOTA in its size class at release
~1.5GB VRAM, runs on CPU
sentence-transformers
ollama run hf.co/ChristianAzinn/gte-large-en-v1.5-ggufdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/ChristianAzinn/gte-large-en-v1.5-gguf
Hugging Face ↗
· sentence-transformers
stella_en_400M_v5
0.87 GBNovaSearch (dunzhang) · 0.435B · MIT
📊 MTEB (English v1) ~70 avg — top small model; near 1.5B quality
~1.5GB VRAM, runs on CPU
sentence-transformers
ollama run hf.co/dunzhang/stella_en_400M_v5docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/dunzhang/stella_en_400M_v5
Hugging Face ↗
· sentence-transformers
nomic-embed-text-v2-moe
0.94 GBNomic AI · 0.475B · Apache-2.0
📊 Multilingual MoE; competitive multilingual MTEB at ~305M active params
~1GB VRAM, runs on CPU
ollama
ollama run nomic-embed-text-v2-moedocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull nomic-embed-text-v2-moe
Hugging Face ↗
· ollama
bge-reranker-base
1.1 GBBAAI · 0.278B · MIT
📊 XLM-RoBERTa-base cross-encoder; solid CN/EN reranking
~1GB VRAM, runs on CPU
sentence-transformers
ollama run hf.co/gpustack/bge-reranker-base-GGUFdocker
docker run --gpus all -p 8000:8000 vllm/vllm-openai --model BAAI/bge-reranker-base --task score
Hugging Face ↗
· sentence-transformers
multilingual-e5-large
1.1 GBMicrosoft (intfloat) · 0.56B · MIT
📊 Strong multilingual MTEB; beats BGE-large-en & Cohere multilingual-v3 at release
~1.5GB VRAM, runs on CPU
ollama
ollama run hf.co/yixuan-chia/multilingual-e5-large-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/yixuan-chia/multilingual-e5-large-GGUF
Hugging Face ↗
· ollama
jina-embeddings-v3
1.1 GBJina AI · 0.572B · CC-BY-NC-4.0 (non-commercial)
📊 Outperforms OpenAI text-embedding-3-large & Cohere on MTEB multilingual at release (Sep 2024)
~1.5GB VRAM, runs on CPU
sentence-transformers
ollama run hf.co/gpustack/jina-embeddings-v3-GGUFdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/gpustack/jina-embeddings-v3-GGUF
Hugging Face ↗
· sentence-transformers
BGE-M3
1.2 GBBAAI · 0.567B · MIT
📊 MIRACL nDCG@10 ~70 (multilingual SOTA at release); strong BEIR; hybrid dense+sparse+ColBERT
~2GB VRAM, runs on CPU
ollama
ollama run bge-m3:567m-fp16docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull bge-m3
Hugging Face ↗
· ollama
bge-reranker-v2-m3
1.2 GBBAAI · 0.568B · Apache-2.0
📊 Multilingual cross-encoder; strong MIRACL/BEIR reranking; lightweight
~2GB VRAM, runs on CPU
sentence-transformers
ollama run hf.co/gpustack/bge-reranker-v2-m3-GGUFdocker
docker run --gpus all -p 8000:8000 vllm/vllm-openai --model BAAI/bge-reranker-v2-m3 --task score
Hugging Face ↗
· sentence-transformers
snowflake-arctic-embed-l-v2.0
1.2 GBSnowflake · 0.568B · Apache-2.0
📊 Top BEIR nDCG@10 + strong CLEF/MIRACL multilingual at release (Dec 2024)
~1.5GB VRAM, runs on CPU
ollama
ollama run snowflake-arctic-embed2:568mdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama pull snowflake-arctic-embed2
Hugging Face ↗
· ollama
MoAI-Embedding-0.6B
1.2 GBBCCard · 0.6B · apache-2.0 · discovered
~3GB VRAM, or CPU with 2GB RAM
docker
pip install sentence-transformers # SentenceTransformer("BCCard/MoAI-Embedding-0.6B")
Hugging Face ↗
· sentence-transformers
plamo-embedding-1b
2 GBpfnet · 1B · apache-2.0 · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
pip install sentence-transformers # SentenceTransformer("pfnet/plamo-embedding-1b")
Hugging Face ↗
· sentence-transformers
llama-nemotron-embed-vl-1b-v2
2 GBnvidia · 1B · other · discovered
~3GB VRAM, or CPU with 3GB RAM
docker
pip install sentence-transformers # SentenceTransformer("nvidia/llama-nemotron-embed-vl-1b-v2")
Hugging Face ↗
· sentence-transformers
Qwen3-Embedding-4B
2.5 GBAlibaba Qwen · 4B · Apache-2.0
📊 MTEB Multilingual mean 69.45; near-SOTA retrieval
~3-6GB VRAM depending on quant
ollama
ollama run dengcao/Qwen3-Embedding-4B:Q4_K_Mdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run dengcao/Qwen3-Embedding-4B:Q4_K_M
Hugging Face ↗
· ollama
Qwen3-Reranker-4B
2.5 GBAlibaba Qwen · 4B · Apache-2.0
📊 SOTA-class open reranker; large gains on BEIR/MIRACL/MTEB-R reranking
~3-6GB VRAM
transformers
ollama run dengcao/Qwen3-Reranker-4B:Q4_K_Mdocker
docker run --gpus all -p 8000:8000 vllm/vllm-openai --model Qwen/Qwen3-Reranker-4B
Hugging Face ↗
· transformers
gte-Qwen2-1.5B-instruct
3.1 GBAlibaba-NLP · 1.5B · Apache-2.0
📊 MTEB ~67 avg; instruction-tuned LLM embedder
~2-4GB VRAM
sentence-transformers
ollama run rjmalagon/gte-qwen2-1.5b-instruct-embed-f16docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run rjmalagon/gte-qwen2-1.5b-instruct-embed-f16
Hugging Face ↗
· sentence-transformers
mxbai-rerank-large-v2
3.1 GBMixedbread AI · 1.5B · Apache-2.0
📊 SOTA-class open reranker (2025); strong BEIR; Qwen2.5-1.5B backbone
~3-4GB VRAM
sentence-transformers
ollama run hf.co/mixedbread-ai/mxbai-rerank-large-v2-ggufdocker
docker run --gpus all -p 8000:8000 vllm/vllm-openai --model mixedbread-ai/mxbai-rerank-large-v2 --task score
Hugging Face ↗
· sentence-transformers
stella_en_1.5B_v5
3.1 GBNovaSearch (dunzhang) · 1.5B · MIT
📊 MTEB (English v1) ~71.2 avg — top-tier open English embedder; basis of jasper (MTEB #2)
~3-4GB VRAM
sentence-transformers
ollama run hf.co/dunzhang/stella_en_1.5B_v5docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run hf.co/dunzhang/stella_en_1.5B_v5
Hugging Face ↗
· sentence-transformers
Qwen3-VL-Embedding-2B
4 GBQwen · 2B · apache-2.0 · discovered
~5GB VRAM, or CPU with 6GB RAM
docker
pip install sentence-transformers # SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
Hugging Face ↗
· sentence-transformers
Qwen3-Embedding-8B
4.7 GBAlibaba Qwen · 8B · Apache-2.0
📊 MTEB Multilingual mean 70.58 — #1 on MTEB multilingual leaderboard (Jun 5 2025)
~6-9GB VRAM at Q4-Q8; runs on CPU slowly
ollama
ollama run dengcao/Qwen3-Embedding-8B:Q4_K_Mdocker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama && docker exec -it ollama ollama run dengcao/Qwen3-Embedding-8B:Q4_K_M
Hugging Face ↗
· ollama
Qwen3-Reranker-8B
4.7 GBAlibaba Qwen · 8B · Apache-2.0
📊 Best open reranker quality in Qwen3 series; top BEIR/MIRACL reranking
~6-9GB VRAM at Q4-Q8
vllm
ollama run hf.co/Mungert/Qwen3-Reranker-8B-GGUFdocker
docker run --gpus all -p 8000:8000 vllm/vllm-openai --model Qwen/Qwen3-Reranker-8B
Hugging Face ↗
· vllm
gte-Qwen2-7B-instruct
4.7 GBAlibaba-NLP · 7B · Apache-2.0
📊 MTEB ~70 avg — #1 English & Chinese MTEB at release (Jun 2024)
~6-8GB VRAM at Q4; 15GB+ at F16
vllm
ollama run hf.co/mradermacher/gte-Qwen2-7B-instruct-GGUF:Q4_K_Mdocker
docker run --gpus all -p 8000:8000 vllm/vllm-openai --model Alibaba-NLP/gte-Qwen2-7B-instruct
Hugging Face ↗
· vllm
MoAI-Embedding-4B
8 GBBCCard · 4B · apache-2.0 · discovered
~10GB VRAM (RTX 3090/4090)
docker
pip install sentence-transformers # SentenceTransformer("BCCard/MoAI-Embedding-4B")
Hugging Face ↗
· sentence-transformers
Qwen3-VL-Embedding-8B
16 GBQwen · 8B · apache-2.0 · discovered
~19GB VRAM (24GB GPU)
docker
pip install sentence-transformers # SentenceTransformer("Qwen/Qwen3-VL-Embedding-8B")
Hugging Face ↗
· sentence-transformers