Deploy API

Promote a fine-tuned adapter to a live endpoint in one click.

Every fine-tuned adapter on Soramai can be promoted to an autoscaling inference endpoint. Call it from the dashboard playground, the CLI, or directly over HTTPS — billed per request, idle for free.

Open dashboard Read the docs

What you get

Soramai endpoints are managed: scaling, queueing, retries, observability, and key management are all built in.

Autoscaling endpoints

Each deployed adapter gets its own HTTPS endpoint that scales from zero to as many workers as you need. Idle endpoints cost nothing.

Streaming responses

Text completions stream tokens over HTTP server-sent events. The schema mirrors common OpenAI-compatible clients.

Per-request billing

You pay for the GPU time each request actually consumes — no per-minute reservations and no warm-pool minimums.

Scoped API keys

Create, rotate, and revoke keys from the dashboard. Keys are scoped to a single account and shown once at creation time.

Region pinning

Pin a deployment to a region for data-residency requirements. Multi-region active-active is available on enterprise plans.

Logs and metrics

Per-request latency, queue depth, and error rate are exposed in the dashboard. Logs are retained for 90 days.

Text completions — request

Send a prompt to a fine-tuned adapter. The deployment is identified by the API key, so the URL is the same for every deployment under your account.

curl https://www.soramai.com/api/v1/inference \
  -H "Authorization: Bearer $SORAMAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Summarize this changelog entry.",
    "system_prompt": "You are a concise technical writer.",
    "temperature": 0.7,
    "max_new_tokens": 256
  }'

Streaming responses

Add stream: true to get tokens back as Server-Sent Events. Compatible with the Vercel AI SDK, openai-node, and any standard SSE client.

curl https://www.soramai.com/api/v1/inference \
  -H "Authorization: Bearer $SORAMAI_API_KEY" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "prompt": "Write a haiku about graphics cards.",
    "stream": true
  }'

# Stream output (each block is one SSE event):
data: {"type":"chunk","text":"Fans spin at midnight—"}

data: {"type":"chunk","text":" silicon dreams of compute,"}

data: {"type":"chunk","text":" oceans of warm watts."}

data: {"type":"done","input_tokens":12,"output_tokens":22,"usage":{"coins_used":18,"coins_per_minute":200,"latency_ms":1843},"model":{"base_model":"Qwen/Qwen2.5-7B-Instruct","name":"haiku-bot"}}

Frame types: chunk (incremental text the client appends), done (final usage + model info, stream ends), and error (fatal error mid-stream, stream ends with no done). The same auth, rate limit, and billing semantics apply as the non-streaming endpoint — streaming is purely a transport choice.

JavaScript / TypeScript example

A complete streaming consumer in ~20 lines. Pairs cleanly with React.useState updates for a typewriter UI.

const r = await fetch("https://www.soramai.com/api/v1/inference", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.SORAMAI_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ prompt: "Hello", stream: true }),
});

const reader = r.body!.getReader();
const decoder = new TextDecoder();
let buf = "";
let output = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buf += decoder.decode(value, { stream: true });
  const frames = buf.split("\n\n");
  buf = frames.pop() ?? "";
  for (const frame of frames) {
    const data = frame.replace(/^data:\s*/, "").trim();
    if (!data) continue;
    const evt = JSON.parse(data);
    if (evt.type === "chunk") output += evt.text;
    if (evt.type === "done") console.log("usage:", evt.usage);
  }
}
console.log("final:", output);

Python example

Streaming consumer using the standard requests library. Use httpx if you need asyncio.

import json
import os
import requests

with requests.post(
    "https://www.soramai.com/api/v1/inference",
    headers={"Authorization": f"Bearer {os.environ['SORAMAI_API_KEY']}"},
    json={"prompt": "Hello", "stream": True},
    stream=True,
) as r:
    output = ""
    for raw in r.iter_lines(decode_unicode=True):
        if not raw or not raw.startswith("data:"):
            continue
        evt = json.loads(raw[len("data:"):].strip())
        if evt["type"] == "chunk":
            output += evt["text"]
            print(evt["text"], end="", flush=True)
        elif evt["type"] == "done":
            print(f"\n\n[usage] {evt['usage']}")
        elif evt["type"] == "error":
            raise RuntimeError(evt["text"])
print(output)

Non-streaming response shape

When stream is omitted or false, the API returns a single JSON document with the full response. Lowest-effort integration if you don't need typewriter UX.

{
  "response": "Soramai now supports per-second training billing...",
  "input_tokens": 124,
  "output_tokens": 87,
  "usage": {
    "coins_used": 31,
    "coins_per_minute": 200,
    "latency_ms": 1843
  },
  "model": {
    "base_model": "Qwen/Qwen2.5-7B-Instruct",
    "name": "supportbot"
  }
}

Image generation

Deploy a fine-tuned FLUX adapter, then generate images over HTTPS. Same Bearer-key auth as text — the deployment is identified by your API key, so the URL is the same for every image deployment under your account.

curl https://www.soramai.com/api/v1/images \
  -H "Authorization: Bearer $SORAMAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a portrait of mytoken, studio lighting, 85mm",
    "negative_prompt": "blurry, low quality",
    "width": 1024,
    "height": 1024,
    "steps": 28,
    "guidance_scale": 3.5
  }'

# Response:
{
  "id": "5f2c8e7a-...",
  "image_url": "https://<signed-url, 1h TTL>",
  "created_at": "2026-06-07T19:24:11.000Z",
  "model": "black-forest-labs/FLUX.1-dev",
  "params": { "width": 1024, "height": 1024, "steps": 28, "guidance_scale": 3.5, "seed": 781234 },
  "usage": { "gpu_seconds": 6.2, "coins": 37, "cold_start": false }
}

Include your fine-tune's trigger word in the prompt. Each call returns one image and is billed per image; the worker scales to zero when idle. You can also generate interactively in the playground.