Deploy API

Promote a fine-tuned adapter to a live endpoint in one click.

Every fine-tuned adapter on Soramai can be promoted to an autoscaling inference endpoint. Call it from the dashboard playground, the CLI, or directly over HTTPS — billed per request, idle for free.

What you get

Soramai endpoints are managed: scaling, queueing, retries, observability, and key management are all built in.

Autoscaling endpoints

Each deployed adapter gets its own HTTPS endpoint that scales from zero to as many workers as you need. Idle endpoints cost nothing.

Streaming responses

Text completions stream tokens over HTTP server-sent events. The schema mirrors common OpenAI-compatible clients.

Per-request billing

You pay for the GPU time each request actually consumes — no per-minute reservations and no warm-pool minimums.

Scoped API keys

Create, rotate, and revoke keys from the dashboard. Keys are scoped to a single account and shown once at creation time.

Region pinning

Pin a deployment to a region for data-residency requirements. Multi-region active-active is available on enterprise plans.

Logs and metrics

Per-request latency, queue depth, and error rate are exposed in the dashboard. Logs are retained for 90 days.

Text completions — request

Send a prompt to a fine-tuned adapter. The deployment is identified by the API key, so the URL is the same for every deployment under your account.

curl https://soramai.com/api/v1/inference \
  -H "Authorization: Bearer $SORAMAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Summarize this changelog entry.",
    "system_prompt": "You are a concise technical writer.",
    "temperature": 0.7,
    "max_new_tokens": 256
  }'

Streaming responses

Add stream: true to get tokens back as Server-Sent Events. Compatible with the Vercel AI SDK, openai-node, and any standard SSE client.

curl https://soramai.com/api/v1/inference \
  -H "Authorization: Bearer $SORAMAI_API_KEY" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "prompt": "Write a haiku about graphics cards.",
    "stream": true
  }'

# Stream output (each block is one SSE event):
data: {"type":"chunk","text":"Fans spin at midnight—"}

data: {"type":"chunk","text":" silicon dreams of compute,"}

data: {"type":"chunk","text":" oceans of warm watts."}

data: {"type":"done","input_tokens":12,"output_tokens":22,"usage":{"coins_used":18,"coins_per_minute":200,"latency_ms":1843},"model":{"base_model":"Qwen/Qwen2.5-7B-Instruct","name":"haiku-bot"}}

Frame types: chunk (incremental text the client appends), done (final usage + model info, stream ends), and error (fatal error mid-stream, stream ends with no done). The same auth, rate limit, and billing semantics apply as the non-streaming endpoint — streaming is purely a transport choice.

JavaScript / TypeScript example

A complete streaming consumer in ~20 lines. Pairs cleanly with React.useState updates for a typewriter UI.

const r = await fetch("https://soramai.com/api/v1/inference", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.SORAMAI_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ prompt: "Hello", stream: true }),
});

const reader = r.body!.getReader();
const decoder = new TextDecoder();
let buf = "";
let output = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buf += decoder.decode(value, { stream: true });
  const frames = buf.split("\n\n");
  buf = frames.pop() ?? "";
  for (const frame of frames) {
    const data = frame.replace(/^data:\s*/, "").trim();
    if (!data) continue;
    const evt = JSON.parse(data);
    if (evt.type === "chunk") output += evt.text;
    if (evt.type === "done") console.log("usage:", evt.usage);
  }
}
console.log("final:", output);

Python example

Streaming consumer using the standard requests library. Use httpx if you need asyncio.

import json
import os
import requests

with requests.post(
    "https://soramai.com/api/v1/inference",
    headers={"Authorization": f"Bearer {os.environ['SORAMAI_API_KEY']}"},
    json={"prompt": "Hello", "stream": True},
    stream=True,
) as r:
    output = ""
    for raw in r.iter_lines(decode_unicode=True):
        if not raw or not raw.startswith("data:"):
            continue
        evt = json.loads(raw[len("data:"):].strip())
        if evt["type"] == "chunk":
            output += evt["text"]
            print(evt["text"], end="", flush=True)
        elif evt["type"] == "done":
            print(f"\n\n[usage] {evt['usage']}")
        elif evt["type"] == "error":
            raise RuntimeError(evt["text"])
print(output)

Non-streaming response shape

When stream is omitted or false, the API returns a single JSON document with the full response. Lowest-effort integration if you don't need typewriter UX.

{
  "response": "Soramai now supports per-second training billing...",
  "input_tokens": 124,
  "output_tokens": 87,
  "usage": {
    "coins_used": 31,
    "coins_per_minute": 200,
    "latency_ms": 1843
  },
  "model": {
    "base_model": "Qwen/Qwen2.5-7B-Instruct",
    "name": "supportbot"
  }
}

Image generation

Image LoRA inference is currently available through the in-app playground only. A dedicated image inference API is on the roadmap.

To generate images from a fine-tuned FLUX adapter, open the adapter on the playground. The Deploy API will route image-model endpoints once the worker contract is finalised; until then, deploy attempts on image adapters are rejected server-side.