NVIDIA models · LLM Switchboard

build.nvidia.com hosts NIM microservices behind an OpenAI-compatible endpoint integrate.api.nvidia.com/v1. Generous free credits to start.

Model	Maker	Type	Context	$ In	$ Out	Speed t/s	Capability	Best for
DeepSeek V4 Pro deepseek-ai/deepseek-v4-pro	DeepSeek AI	chat	1M	$0.43	$0.87	45	96	Frontier MoE (1.6T total / ~49B active, hybrid compressed-sparse attention) with ~1M context and three reasoning modes (Non-think / Think High / Think Max).
GLM-5.1 z-ai/glm-5.1	Z.ai (Zhipu)	chat	203k	$0.98	$3.08	45	95	Flagship agentic-engineering LLM (754B total / ~40B active, 256 routed experts) for coding, agentic workflows and long-horizon reasoning; sustains optimization
DeepSeek-R1 deepseek-ai/deepseek-r1	DeepSeek AI	chat	164k	$0.70	$2.50	45	94	Established 671B MoE open reasoning model with full reasoning traces; strong math, code and logic.
Kimi K2 Instruct moonshotai/kimi-k2-instruct	Moonshot AI	chat	131k	$0.57	$2.30	45	94	Large open MoE (1T total / 32B active) tuned for agentic tool use, coding and general chat; strong instruction following.
Qwen3 Coder 480B-A35B Instruct qwen/qwen3-coder-480b-a35b-instruct	Qwen (Alibaba)	chat	262k	—	—	90	93	State-of-the-art open coding/agentic-coding MoE (480B total / 35B active); native 256K context extendable to ~1M via YaRN.
NVIDIA Nemotron 3 Ultra 550B-A55B nvidia/nemotron-3-ultra-550b-a55b	NVIDIA	chat	1M	$0.50	$2.20	70	92	Frontier-tier open MoE reasoning (550B total / 55B active, hybrid Mamba-2 + LatentMoE) for maximum accuracy on hard reasoning, math and agentic decision-making.
Llama 3.1 Nemotron Ultra 253B v1 nvidia/llama-3.1-nemotron-ultra-253b-v1	NVIDIA	chat	128k	—	—	45	92	Prior-gen high-accuracy open reasoning, agentic tool-calling, RAG and complex math/coding; derivative of Llama-3.1-405B compressed via Neural Architecture Searc
Qwen3 235B-A22B qwen/qwen3-235b-a22b	Qwen (Alibaba)	chat	131k	$0.46	$1.82	110	90	Flagship Qwen3 hybrid-reasoning MoE (235B total / 22B active) with toggleable thinking mode; strong multilingual chat, reasoning, math and tool use.
Llama 4 Maverick 17B-128E Instruct meta/llama-4-maverick-17b-128e-instruct	Meta	chat	1M	—	—	45	90	Natively multimodal MoE (400B total / 17B active, 128 experts, early fusion) with ~1M context; multilingual text+image input, chat, knowledge and code.
gpt-oss-120b openai/gpt-oss-120b	OpenAI	chat	131k	$0.04	$0.18	45	89	OpenAI open-weight 117B MoE for high-reasoning, agentic and general-purpose production use; configurable reasoning effort, tool calling and structured outputs.
NVIDIA Nemotron 3 Super 120B-A12B nvidia/nemotron-3-super-120b-a12b	NVIDIA	chat	1M	$0.09	$0.45	150	87	Newest-gen hybrid Mamba-2 + LatentMoE reasoning (120B total / 12B active, first Nemotron pre-trained in NVFP4) with up to ~1M context for deep document reasonin
Qwen3-Next 80B-A3B Instruct qwen/qwen3-next-80b-a3b-instruct	Qwen (Alibaba)	chat	262k	$0.09	$1.10	240	85	Efficient ultra-sparse MoE (80B total / 3B active) for fast, low-cost instruct chat and agentic tasks at long context; high throughput per active parameter.
Llama 3.3 Nemotron Super 49B v1.5 nvidia/llama-3.3-nemotron-super-49b-v1.5	NVIDIA	chat	131k	$0.40	$0.40	55	80	Balanced accuracy/compute reasoning that fits on a single H200 (derivative of Llama-3.3-70B via NAS); agentic workflows, RAG, tool calling.
Llama 3.3 70B Instruct meta/llama-3.3-70b-instruct	Meta	chat	131k	$0.10	$0.32	55	78	Widely-used dense 70B instruct model for general chat, tool calling and RAG; reliable baseline with broad ecosystem support.
gpt-oss-20b openai/gpt-oss-20b	OpenAI	chat	131k	$0.03	$0.14	110	76	Smaller OpenAI open-weight MoE for cost-efficient reasoning and agentic tasks; good for latency-sensitive or local-friendly deployments.
NVIDIA Nemotron 3 Nano 30B-A3B nvidia/nemotron-3-nano-30b-a3b	NVIDIA	chat	262k	$0.05	$0.20	240	74	Efficient open hybrid Mamba-2 + MoE reasoning (30B total / 3B active) with up to ~1M context and configurable thinking budget; ~4x throughput of Nemotron 2 Nano
NVIDIA Nemotron 3 Nano Omni 30B-A3B (Reasoning) nvidia/nemotron-3-nano-omni-30b-a3b-reasoning	NVIDIA	chat	256k	free	free	240	73	Multimodal perception sub-agent for agentic AI: native text, image, video and audio input with reasoning.
DeepSeek V4 Flash deepseek-ai/deepseek-v4-flash	DeepSeek AI	chat	1M	$0.09	$0.18	80	66	Fast, cost-efficient MoE with ~1M context optimized for high-throughput coding and agentic workflows; the latency-oriented sibling of V4 Pro.
DeepSeek V3.1 deepseek-ai/deepseek-v3.1	DeepSeek AI	chat	164k	$0.21	$0.79	80	66	Hybrid model that toggles reasoning on/off in one deployment while keeping V3-family fast single-pass generation; general chat, coding and agentic use.
DeepSeek V3.2 deepseek-ai/deepseek-v3.2	DeepSeek AI	chat	131k	$0.23	$0.34	80	66	Incremental V3.x update with improved efficiency and reasoning; general-purpose chat, coding and tool-calling.
Llama 3.1 Nemotron Nano 8B v1 nvidia/llama-3.1-nemotron-nano-8b-v1	NVIDIA	chat	128k	—	—	150	59	Cost-efficient on-device/edge reasoning and agentic tasks; smallest prior-gen Nemotron reasoning tier for latency-sensitive deployments.
Llama 3.1 8B Instruct meta/llama-3.1-8b-instruct	Meta	chat	131k	$0.02	$0.03	150	57	Small, fast dense model for cheap high-volume chat, classification and simple tool use; common fallback/draft model.
Llama Embed Nemotron 8B nvidia/llama-embed-nemotron-8b	NVIDIA	embedding	33k	—	—	—	25	Flagship multilingual/cross-lingual text embedding model (Llama-3.1-8B with bidirectional attention, 4096-dim output); instruction-aware, top of the multilingua
NeMo Retriever Llama 3.2 EmbedQA 1B v2 nvidia/llama-3.2-nv-embedqa-1b-v2	NVIDIA	embedding	8k	—	—	—	25	Production RAG embedding NIM optimized for multilingual/cross-lingual QA retrieval; Matryoshka (dynamic) embedding size, up to 8192-token documents.
Llama Nemotron Rerank 1B v2 nvidia/llama-nemotron-rerank-1b-v2	NVIDIA	rerank	8k	—	—	—	25	Cross-encoder reranker NIM that reorders retrieved passages by relevance; pairs with the NeMo Retriever / Llama embedding models to boost RAG accuracy (BEIR+Tec
NeMo Retriever Llama 3.2 RerankQA 1B v2 nvidia/llama-3.2-nv-rerankqa-1b-v2	NVIDIA	rerank	8k	—	—	—	25	Established multilingual/cross-lingual reranking NIM; the embedqa-1b-v2 + rerankqa-1b-v2 pipeline is NVIDIA's reference RAG retrieval stack.
Llama Nemotron Rerank VL 1B v2 nvidia/llama-nemotron-rerank-vl-1b-v2	NVIDIA	rerank	8k	—	—	—	25	Vision-language reranker for multimodal/document RAG; reorders text-and-image candidates by relevance for document-extraction and visual retrieval pipelines.