NVIDIA models

build.nvidia.com NIM catalog — OpenAI-compatible.

catalog updated 47m ago
build.nvidia.com hosts NIM microservices behind an OpenAI-compatible endpoint integrate.api.nvidia.com/v1. Generous free credits to start.
27 / 27
ModelMakerTypeContext$ In$ OutSpeed t/sCapabilityBest for
DeepSeek V4 Pro
deepseek-ai/deepseek-v4-pro
DeepSeek AIchat1M$0.43$0.874596Frontier MoE (1.6T total / ~49B active, hybrid compressed-sparse attention) with ~1M context and three reasoning modes (Non-think / Think High / Think Max).
GLM-5.1
z-ai/glm-5.1
Z.ai (Zhipu)chat203k$0.98$3.084595Flagship agentic-engineering LLM (754B total / ~40B active, 256 routed experts) for coding, agentic workflows and long-horizon reasoning; sustains optimization
DeepSeek-R1
deepseek-ai/deepseek-r1
DeepSeek AIchat164k$0.70$2.504594Established 671B MoE open reasoning model with full reasoning traces; strong math, code and logic.
Kimi K2 Instruct
moonshotai/kimi-k2-instruct
Moonshot AIchat131k$0.57$2.304594Large open MoE (1T total / 32B active) tuned for agentic tool use, coding and general chat; strong instruction following.
Qwen3 Coder 480B-A35B Instruct
qwen/qwen3-coder-480b-a35b-instruct
Qwen (Alibaba)chat262k9093State-of-the-art open coding/agentic-coding MoE (480B total / 35B active); native 256K context extendable to ~1M via YaRN.
NVIDIA Nemotron 3 Ultra 550B-A55B
nvidia/nemotron-3-ultra-550b-a55b
NVIDIAchat1M$0.50$2.207092Frontier-tier open MoE reasoning (550B total / 55B active, hybrid Mamba-2 + LatentMoE) for maximum accuracy on hard reasoning, math and agentic decision-making.
Llama 3.1 Nemotron Ultra 253B v1
nvidia/llama-3.1-nemotron-ultra-253b-v1
NVIDIAchat128k4592Prior-gen high-accuracy open reasoning, agentic tool-calling, RAG and complex math/coding; derivative of Llama-3.1-405B compressed via Neural Architecture Searc
Qwen3 235B-A22B
qwen/qwen3-235b-a22b
Qwen (Alibaba)chat131k$0.46$1.8211090Flagship Qwen3 hybrid-reasoning MoE (235B total / 22B active) with toggleable thinking mode; strong multilingual chat, reasoning, math and tool use.
Llama 4 Maverick 17B-128E Instruct
meta/llama-4-maverick-17b-128e-instruct
Metachat1M4590Natively multimodal MoE (400B total / 17B active, 128 experts, early fusion) with ~1M context; multilingual text+image input, chat, knowledge and code.
gpt-oss-120b
openai/gpt-oss-120b
OpenAIchat131k$0.04$0.184589OpenAI open-weight 117B MoE for high-reasoning, agentic and general-purpose production use; configurable reasoning effort, tool calling and structured outputs.
NVIDIA Nemotron 3 Super 120B-A12B
nvidia/nemotron-3-super-120b-a12b
NVIDIAchat1M$0.09$0.4515087Newest-gen hybrid Mamba-2 + LatentMoE reasoning (120B total / 12B active, first Nemotron pre-trained in NVFP4) with up to ~1M context for deep document reasonin
Qwen3-Next 80B-A3B Instruct
qwen/qwen3-next-80b-a3b-instruct
Qwen (Alibaba)chat262k$0.09$1.1024085Efficient ultra-sparse MoE (80B total / 3B active) for fast, low-cost instruct chat and agentic tasks at long context; high throughput per active parameter.
Llama 3.3 Nemotron Super 49B v1.5
nvidia/llama-3.3-nemotron-super-49b-v1.5
NVIDIAchat131k$0.40$0.405580Balanced accuracy/compute reasoning that fits on a single H200 (derivative of Llama-3.3-70B via NAS); agentic workflows, RAG, tool calling.
Llama 3.3 70B Instruct
meta/llama-3.3-70b-instruct
Metachat131k$0.10$0.325578Widely-used dense 70B instruct model for general chat, tool calling and RAG; reliable baseline with broad ecosystem support.
gpt-oss-20b
openai/gpt-oss-20b
OpenAIchat131k$0.03$0.1411076Smaller OpenAI open-weight MoE for cost-efficient reasoning and agentic tasks; good for latency-sensitive or local-friendly deployments.
NVIDIA Nemotron 3 Nano 30B-A3B
nvidia/nemotron-3-nano-30b-a3b
NVIDIAchat262k$0.05$0.2024074Efficient open hybrid Mamba-2 + MoE reasoning (30B total / 3B active) with up to ~1M context and configurable thinking budget; ~4x throughput of Nemotron 2 Nano
NVIDIA Nemotron 3 Nano Omni 30B-A3B (Reasoning)
nvidia/nemotron-3-nano-omni-30b-a3b-reasoning
NVIDIAchat256kfreefree24073Multimodal perception sub-agent for agentic AI: native text, image, video and audio input with reasoning.
DeepSeek V4 Flash
deepseek-ai/deepseek-v4-flash
DeepSeek AIchat1M$0.09$0.188066Fast, cost-efficient MoE with ~1M context optimized for high-throughput coding and agentic workflows; the latency-oriented sibling of V4 Pro.
DeepSeek V3.1
deepseek-ai/deepseek-v3.1
DeepSeek AIchat164k$0.21$0.798066Hybrid model that toggles reasoning on/off in one deployment while keeping V3-family fast single-pass generation; general chat, coding and agentic use.
DeepSeek V3.2
deepseek-ai/deepseek-v3.2
DeepSeek AIchat131k$0.23$0.348066Incremental V3.x update with improved efficiency and reasoning; general-purpose chat, coding and tool-calling.
Llama 3.1 Nemotron Nano 8B v1
nvidia/llama-3.1-nemotron-nano-8b-v1
NVIDIAchat128k15059Cost-efficient on-device/edge reasoning and agentic tasks; smallest prior-gen Nemotron reasoning tier for latency-sensitive deployments.
Llama 3.1 8B Instruct
meta/llama-3.1-8b-instruct
Metachat131k$0.02$0.0315057Small, fast dense model for cheap high-volume chat, classification and simple tool use; common fallback/draft model.
Llama Embed Nemotron 8B
nvidia/llama-embed-nemotron-8b
NVIDIAembedding33k25Flagship multilingual/cross-lingual text embedding model (Llama-3.1-8B with bidirectional attention, 4096-dim output); instruction-aware, top of the multilingua
NeMo Retriever Llama 3.2 EmbedQA 1B v2
nvidia/llama-3.2-nv-embedqa-1b-v2
NVIDIAembedding8k25Production RAG embedding NIM optimized for multilingual/cross-lingual QA retrieval; Matryoshka (dynamic) embedding size, up to 8192-token documents.
Llama Nemotron Rerank 1B v2
nvidia/llama-nemotron-rerank-1b-v2
NVIDIArerank8k25Cross-encoder reranker NIM that reorders retrieved passages by relevance; pairs with the NeMo Retriever / Llama embedding models to boost RAG accuracy (BEIR+Tec
NeMo Retriever Llama 3.2 RerankQA 1B v2
nvidia/llama-3.2-nv-rerankqa-1b-v2
NVIDIArerank8k25Established multilingual/cross-lingual reranking NIM; the embedqa-1b-v2 + rerankqa-1b-v2 pipeline is NVIDIA's reference RAG retrieval stack.
Llama Nemotron Rerank VL 1B v2
nvidia/llama-nemotron-rerank-vl-1b-v2
NVIDIArerank8k25Vision-language reranker for multimodal/document RAG; reorders text-and-image candidates by relevance for document-extraction and visual retrieval pipelines.
Sign in to continue

LLM Switchboard is private — sign in with Authly to access the control room.

Sign in with Authly
← Back to home