build.nvidia.com hosts NIM microservices behind an OpenAI-compatible endpoint
integrate.api.nvidia.com/v1. Generous free credits to start.| Model | Maker | Type | Context | $ In | $ Out | Speed t/s | Capability | Best for |
|---|---|---|---|---|---|---|---|---|
DeepSeek V4 Pro deepseek-ai/deepseek-v4-pro | DeepSeek AI | chat | 1M | $0.43 | $0.87 | 45 | Frontier MoE (1.6T total / ~49B active, hybrid compressed-sparse attention) with ~1M context and three reasoning modes (Non-think / Think High / Think Max). | |
GLM-5.1 z-ai/glm-5.1 | Z.ai (Zhipu) | chat | 203k | $0.98 | $3.08 | 45 | Flagship agentic-engineering LLM (754B total / ~40B active, 256 routed experts) for coding, agentic workflows and long-horizon reasoning; sustains optimization | |
DeepSeek-R1 deepseek-ai/deepseek-r1 | DeepSeek AI | chat | 164k | $0.70 | $2.50 | 45 | Established 671B MoE open reasoning model with full reasoning traces; strong math, code and logic. | |
Kimi K2 Instruct moonshotai/kimi-k2-instruct | Moonshot AI | chat | 131k | $0.57 | $2.30 | 45 | Large open MoE (1T total / 32B active) tuned for agentic tool use, coding and general chat; strong instruction following. | |
Qwen3 Coder 480B-A35B Instruct qwen/qwen3-coder-480b-a35b-instruct | Qwen (Alibaba) | chat | 262k | — | — | 90 | State-of-the-art open coding/agentic-coding MoE (480B total / 35B active); native 256K context extendable to ~1M via YaRN. | |
NVIDIA Nemotron 3 Ultra 550B-A55B nvidia/nemotron-3-ultra-550b-a55b | NVIDIA | chat | 1M | $0.50 | $2.20 | 70 | Frontier-tier open MoE reasoning (550B total / 55B active, hybrid Mamba-2 + LatentMoE) for maximum accuracy on hard reasoning, math and agentic decision-making. | |
Llama 3.1 Nemotron Ultra 253B v1 nvidia/llama-3.1-nemotron-ultra-253b-v1 | NVIDIA | chat | 128k | — | — | 45 | Prior-gen high-accuracy open reasoning, agentic tool-calling, RAG and complex math/coding; derivative of Llama-3.1-405B compressed via Neural Architecture Searc | |
Qwen3 235B-A22B qwen/qwen3-235b-a22b | Qwen (Alibaba) | chat | 131k | $0.46 | $1.82 | 110 | Flagship Qwen3 hybrid-reasoning MoE (235B total / 22B active) with toggleable thinking mode; strong multilingual chat, reasoning, math and tool use. | |
Llama 4 Maverick 17B-128E Instruct meta/llama-4-maverick-17b-128e-instruct | Meta | chat | 1M | — | — | 45 | Natively multimodal MoE (400B total / 17B active, 128 experts, early fusion) with ~1M context; multilingual text+image input, chat, knowledge and code. | |
gpt-oss-120b openai/gpt-oss-120b | OpenAI | chat | 131k | $0.04 | $0.18 | 45 | OpenAI open-weight 117B MoE for high-reasoning, agentic and general-purpose production use; configurable reasoning effort, tool calling and structured outputs. | |
NVIDIA Nemotron 3 Super 120B-A12B nvidia/nemotron-3-super-120b-a12b | NVIDIA | chat | 1M | $0.09 | $0.45 | 150 | Newest-gen hybrid Mamba-2 + LatentMoE reasoning (120B total / 12B active, first Nemotron pre-trained in NVFP4) with up to ~1M context for deep document reasonin | |
Qwen3-Next 80B-A3B Instruct qwen/qwen3-next-80b-a3b-instruct | Qwen (Alibaba) | chat | 262k | $0.09 | $1.10 | 240 | Efficient ultra-sparse MoE (80B total / 3B active) for fast, low-cost instruct chat and agentic tasks at long context; high throughput per active parameter. | |
Llama 3.3 Nemotron Super 49B v1.5 nvidia/llama-3.3-nemotron-super-49b-v1.5 | NVIDIA | chat | 131k | $0.40 | $0.40 | 55 | Balanced accuracy/compute reasoning that fits on a single H200 (derivative of Llama-3.3-70B via NAS); agentic workflows, RAG, tool calling. | |
Llama 3.3 70B Instruct meta/llama-3.3-70b-instruct | Meta | chat | 131k | $0.10 | $0.32 | 55 | Widely-used dense 70B instruct model for general chat, tool calling and RAG; reliable baseline with broad ecosystem support. | |
gpt-oss-20b openai/gpt-oss-20b | OpenAI | chat | 131k | $0.03 | $0.14 | 110 | Smaller OpenAI open-weight MoE for cost-efficient reasoning and agentic tasks; good for latency-sensitive or local-friendly deployments. | |
NVIDIA Nemotron 3 Nano 30B-A3B nvidia/nemotron-3-nano-30b-a3b | NVIDIA | chat | 262k | $0.05 | $0.20 | 240 | Efficient open hybrid Mamba-2 + MoE reasoning (30B total / 3B active) with up to ~1M context and configurable thinking budget; ~4x throughput of Nemotron 2 Nano | |
NVIDIA Nemotron 3 Nano Omni 30B-A3B (Reasoning) nvidia/nemotron-3-nano-omni-30b-a3b-reasoning | NVIDIA | chat | 256k | free | free | 240 | Multimodal perception sub-agent for agentic AI: native text, image, video and audio input with reasoning. | |
DeepSeek V4 Flash deepseek-ai/deepseek-v4-flash | DeepSeek AI | chat | 1M | $0.09 | $0.18 | 80 | Fast, cost-efficient MoE with ~1M context optimized for high-throughput coding and agentic workflows; the latency-oriented sibling of V4 Pro. | |
DeepSeek V3.1 deepseek-ai/deepseek-v3.1 | DeepSeek AI | chat | 164k | $0.21 | $0.79 | 80 | Hybrid model that toggles reasoning on/off in one deployment while keeping V3-family fast single-pass generation; general chat, coding and agentic use. | |
DeepSeek V3.2 deepseek-ai/deepseek-v3.2 | DeepSeek AI | chat | 131k | $0.23 | $0.34 | 80 | Incremental V3.x update with improved efficiency and reasoning; general-purpose chat, coding and tool-calling. | |
Llama 3.1 Nemotron Nano 8B v1 nvidia/llama-3.1-nemotron-nano-8b-v1 | NVIDIA | chat | 128k | — | — | 150 | Cost-efficient on-device/edge reasoning and agentic tasks; smallest prior-gen Nemotron reasoning tier for latency-sensitive deployments. | |
Llama 3.1 8B Instruct meta/llama-3.1-8b-instruct | Meta | chat | 131k | $0.02 | $0.03 | 150 | Small, fast dense model for cheap high-volume chat, classification and simple tool use; common fallback/draft model. | |
Llama Embed Nemotron 8B nvidia/llama-embed-nemotron-8b | NVIDIA | embedding | 33k | — | — | — | Flagship multilingual/cross-lingual text embedding model (Llama-3.1-8B with bidirectional attention, 4096-dim output); instruction-aware, top of the multilingua | |
NeMo Retriever Llama 3.2 EmbedQA 1B v2 nvidia/llama-3.2-nv-embedqa-1b-v2 | NVIDIA | embedding | 8k | — | — | — | Production RAG embedding NIM optimized for multilingual/cross-lingual QA retrieval; Matryoshka (dynamic) embedding size, up to 8192-token documents. | |
Llama Nemotron Rerank 1B v2 nvidia/llama-nemotron-rerank-1b-v2 | NVIDIA | rerank | 8k | — | — | — | Cross-encoder reranker NIM that reorders retrieved passages by relevance; pairs with the NeMo Retriever / Llama embedding models to boost RAG accuracy (BEIR+Tec | |
NeMo Retriever Llama 3.2 RerankQA 1B v2 nvidia/llama-3.2-nv-rerankqa-1b-v2 | NVIDIA | rerank | 8k | — | — | — | Established multilingual/cross-lingual reranking NIM; the embedqa-1b-v2 + rerankqa-1b-v2 pipeline is NVIDIA's reference RAG retrieval stack. | |
Llama Nemotron Rerank VL 1B v2 nvidia/llama-nemotron-rerank-vl-1b-v2 | NVIDIA | rerank | 8k | — | — | — | Vision-language reranker for multimodal/document RAG; reorders text-and-image candidates by relevance for document-extraction and visual retrieval pipelines. |