OpenAI-compatible API. Same key as Fast Apply and Compact. Base URLDocumentation Index
Fetch the complete documentation index at: https://docs.morphllm.com/llms.txt
Use this file to discover all available pages before exploring further.
https://api.morphllm.com/v1.
| Model | ID | Speed | Context | In / Out per 1M | Modalities |
|---|---|---|---|---|---|
| Qwen 3.5 397B | morph-qwen35-397b | ~120 tok/s | 262k | 3.50 | text + image |
| MiniMax M2.7 | morph-minimax27-230b | ~90 tok/s | 200k | 2.20 | text |
| Qwen 3.6 27B | morph-qwen36-27b | ~100 tok/s | 131k | 2.40 | text |
tools, response_format (JSON mode + JSON schema), structured outputs, logprobs, and reasoning.
How to pick: Qwen 397B is the default. MiniMax has the cheapest output tokens, so it wins on long generations and agent loops. Qwen 27B is dense, so first-token latency is more predictable than MoE. Use Model Router to pick automatically per request.
Prefix Caching
Automatic prefix caching is on for all models. No configuration, no separate pricing tier. In multi-turn conversations and agent loops where the system prompt and prior context repeat across requests, cached prefill skips redundant computation. In production, multi-turn workloads see around 80% cache hit rate, which cuts time-to-first-token roughly in half on long prompts. This matters most for agent inner loops (tool call > result > next step), where the same 10k+ token context prefixes every request.Quick Start
- Python
- TypeScript
- Vercel AI SDK
- cURL
gpt-4o-mini or gpt-5 works. Swap the model ID and base URL.
Tools and Structured Output
reasoning: { effort: "medium" } ("low" / "high"). Reasoning tokens bill as output.
Pricing
Per-token, no minimums. The table above is canonical. Live rates:/v1/models.
- Images (Qwen 397B only) bill as text tokens at the input rate
- 4xx requests are not billed; partial generations bill for tokens returned
Coming Soon
DeepSeek V4 Flash
morph-dsv4flash · MoE, 393k context · BETAPrivate beta, limited capacity. Email us for access.Pitfalls
Latency worse than expected
Latency worse than expected
TPS numbers are generation throughput, not end-to-end. With 30k tokens of context, prefill dominates first-token wait even with caching. For agent loops, keep a smaller working context with Compact rather than filling the full window.
Tool calls not working
Tool calls not working
These models use OpenAI tool-call shape, not Anthropic
tool_use blocks or Gemini functionDeclarations. Use the OpenAI SDK or @ai-sdk/openai pointed at our base URL.JSON mode returns prose
JSON mode returns prose
Pass
response_format: { type: "json_object" } and say “respond in JSON” in your prompt. For strict shape control: response_format: { type: "json_schema", json_schema: { ... } }.See Also
- Model Router — auto-route between these and frontier models per request
- Compact — shrink context before paying for it
- WarpGrep — code search for retrieval when context is the bottleneck