Fast General Models

OpenAI-compatible API. Same key as Fast Apply and Compact. Base URL https://api.morphllm.com/v1.

Model	ID	Speed	Context	In / Out per 1M	Modalities
Qwen 3.5 397B	`morph-qwen35-397b`	~120 tok/s	262k	$0.55 /$ 3.50	text + image
MiniMax M2.7	`morph-minimax27-230b`	~90 tok/s	200k	$0.80 /$ 2.20	text
Qwen 3.6 27B	`morph-qwen36-27b`	~100 tok/s	131k	$0.55 /$ 2.40	text

All models support tools, response_format (JSON mode + JSON schema), structured outputs, logprobs, and reasoning. How to pick: Qwen 397B is the default. MiniMax has the cheapest output tokens, so it wins on long generations and agent loops. Qwen 27B is dense, so first-token latency is more predictable than MoE. Use Model Router to pick automatically per request.

Prefix Caching

Automatic prefix caching is on for all models. No configuration, no separate pricing tier. In multi-turn conversations and agent loops where the system prompt and prior context repeat across requests, cached prefill skips redundant computation. In production, multi-turn workloads see around 80% cache hit rate, which cuts time-to-first-token roughly in half on long prompts. This matters most for agent inner loops (tool call > result > next step), where the same 10k+ token context prefixes every request.

Quick Start

Python
TypeScript
Vercel AI SDK
cURL

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.morphllm.com/v1",
)

response = client.chat.completions.create(
    model="morph-qwen35-397b",
    messages=[
        {"role": "system", "content": "You are a senior backend engineer."},
        {"role": "user", "content": "Refactor this Express handler to use async/await: ..."},
    ],
    temperature=0.2,
)

print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "YOUR_API_KEY",
  baseURL: "https://api.morphllm.com/v1",
});

const stream = await client.chat.completions.create({
  model: "morph-qwen35-397b",
  messages: [{ role: "user", content: "Write a tiny rate limiter in TS." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";

const morph = createOpenAI({
  apiKey: process.env.MORPH_API_KEY!,
  baseURL: "https://api.morphllm.com/v1",
});

const { text } = await generateText({
  model: morph("morph-qwen36-27b"),
  prompt: "Summarize this PR diff in one paragraph: ...",
});

curl -X POST "https://api.morphllm.com/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "morph-qwen35-397b",
    "messages": [
      {"role": "user", "content": "Write a SQL query that finds the top 5 customers by revenue last quarter."}
    ],
    "temperature": 0.2
  }'

Code that runs against gpt-4o-mini or gpt-5 works. Swap the model ID and base URL.

Tools and Structured Output

const response = await client.chat.completions.create({
  model: "morph-minimax27-230b",
  messages: [{ role: "user", content: "What's the weather in SF?" }],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get weather for a city",
        parameters: {
          type: "object",
          properties: { city: { type: "string" } },
          required: ["city"],
        },
      },
    },
  ],
  response_format: { type: "json_object" },
});

Reasoning is off by default. Enable with reasoning: { effort: "medium" } ("low" / "high"). Reasoning tokens bill as output.

Pricing

Per-token, no minimums. The table above is canonical. Live rates: /v1/models.

Images (Qwen 397B only) bill as text tokens at the input rate
4xx requests are not billed; partial generations bill for tokens returned

Coming Soon

DeepSeek V4 Flash

morph-dsv4flash · MoE, 393k context · BETAPrivate beta, limited capacity. Email us for access.

Pitfalls

Latency worse than expected

TPS numbers are generation throughput, not end-to-end. With 30k tokens of context, prefill dominates first-token wait even with caching. For agent loops, keep a smaller working context with Compact rather than filling the full window.

Tool calls not working

These models use OpenAI tool-call shape, not Anthropic tool_use blocks or Gemini functionDeclarations. Use the OpenAI SDK or @ai-sdk/openai pointed at our base URL.

JSON mode returns prose

Pass response_format: { type: "json_object" } and say “respond in JSON” in your prompt. For strict shape control: response_format: { type: "json_schema", json_schema: { ... } }.

Get Started

Products

API Reference

Integrations

Prefix Caching

Quick Start

Tools and Structured Output

Pricing

Coming Soon

DeepSeek V4 Flash

Pitfalls

See Also

Get Started

Products

API Reference

Integrations

Documentation Index

​Prefix Caching

​Quick Start

​Tools and Structured Output

​Pricing

​Coming Soon

DeepSeek V4 Flash

​Pitfalls

​See Also

Prefix Caching

Quick Start

Tools and Structured Output

Pricing

Coming Soon

Pitfalls

See Also