Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.morphllm.com/llms.txt

Use this file to discover all available pages before exploring further.

OpenAI-compatible API. Same key as Fast Apply and Compact. Base URL https://api.morphllm.com/v1.
ModelIDSpeedContextIn / Out per 1MModalities
Qwen 3.5 397Bmorph-qwen35-397b~120 tok/s262k0.55/0.55 / 3.50text + image
MiniMax M2.7morph-minimax27-230b~90 tok/s200k0.80/0.80 / 2.20text
Qwen 3.6 27Bmorph-qwen36-27b~100 tok/s131k0.55/0.55 / 2.40text
All models support tools, response_format (JSON mode + JSON schema), structured outputs, logprobs, and reasoning. How to pick: Qwen 397B is the default. MiniMax has the cheapest output tokens, so it wins on long generations and agent loops. Qwen 27B is dense, so first-token latency is more predictable than MoE. Use Model Router to pick automatically per request.

Prefix Caching

Automatic prefix caching is on for all models. No configuration, no separate pricing tier. In multi-turn conversations and agent loops where the system prompt and prior context repeat across requests, cached prefill skips redundant computation. In production, multi-turn workloads see around 80% cache hit rate, which cuts time-to-first-token roughly in half on long prompts. This matters most for agent inner loops (tool call > result > next step), where the same 10k+ token context prefixes every request.

Quick Start

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.morphllm.com/v1",
)

response = client.chat.completions.create(
    model="morph-qwen35-397b",
    messages=[
        {"role": "system", "content": "You are a senior backend engineer."},
        {"role": "user", "content": "Refactor this Express handler to use async/await: ..."},
    ],
    temperature=0.2,
)

print(response.choices[0].message.content)
Code that runs against gpt-4o-mini or gpt-5 works. Swap the model ID and base URL.

Tools and Structured Output

const response = await client.chat.completions.create({
  model: "morph-minimax27-230b",
  messages: [{ role: "user", content: "What's the weather in SF?" }],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get weather for a city",
        parameters: {
          type: "object",
          properties: { city: { type: "string" } },
          required: ["city"],
        },
      },
    },
  ],
  response_format: { type: "json_object" },
});
Reasoning is off by default. Enable with reasoning: { effort: "medium" } ("low" / "high"). Reasoning tokens bill as output.

Pricing

Per-token, no minimums. The table above is canonical. Live rates: /v1/models.
  • Images (Qwen 397B only) bill as text tokens at the input rate
  • 4xx requests are not billed; partial generations bill for tokens returned

Coming Soon

DeepSeek V4 Flash

morph-dsv4flash · MoE, 393k context · BETAPrivate beta, limited capacity. Email us for access.

Pitfalls

TPS numbers are generation throughput, not end-to-end. With 30k tokens of context, prefill dominates first-token wait even with caching. For agent loops, keep a smaller working context with Compact rather than filling the full window.
These models use OpenAI tool-call shape, not Anthropic tool_use blocks or Gemini functionDeclarations. Use the OpenAI SDK or @ai-sdk/openai pointed at our base URL.
Pass response_format: { type: "json_object" } and say “respond in JSON” in your prompt. For strict shape control: response_format: { type: "json_schema", json_schema: { ... } }.

See Also

  • Model Router — auto-route between these and frontier models per request
  • Compact — shrink context before paying for it
  • WarpGrep — code search for retrieval when context is the bottleneck