Soramai · Docs

Inference & Deploy.

Two surfaces for running a fine-tuned adapter: an in-dashboard Playground for testing, and a per-deployment HTTPS endpoint for production traffic. Both are autoscaling and billed only when serving.

Open the playground All docs

Playground vs Deploy API — which to use

Same model, two access modes. Pick based on whether the traffic is yours or your customers'.

Aspect	Playground	Deploy API
Best for	Manual testing, smoke-checks, demos	Production traffic from your app
Auth	Browser session	API key (sk-ai-...)
Billing	Per-minute of warm pod time	Per request (token-priced)
Cold start	~30 s on first message	~2 s typical (warm pool)
Autoscaling	One pod per session	0 → N workers, automatic
Idle cost	Auto-shutdown after 10 min	$0 — workers stop when idle
Rate limit	None (one session is one pod)	Configurable per deployment

Deploying an adapter

One click from the dashboard. The fine-tuned adapter is wrapped into an autoscaling endpoint.

1Open soramai.com/deployments and click New deployment.
2Pick a fine-tuned adapter from your models list.
3Name the deployment (for your reference — not surfaced to API consumers). Confirm.
4Soramai provisions a serverless inference endpoint. You’ll get back a URL and an API key. The key is shown once — copy it immediately.

Deploying an image (FLUX) adapter works the same way and exposes POST /api/v1/images — send a prompt and get a generated image back. See the Deploy API reference for the image request/response shape.

Making a request

HTTP POST with a Bearer token. Response is JSON by default, SSE stream if you request it.

curl https://www.soramai.com/api/v1/inference \
  -H "Authorization: Bearer $SORAMAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Summarise this changelog entry.",
    "system_prompt": "You are a concise technical writer.",
    "temperature": 0.7,
    "max_new_tokens": 256
  }'

Required

prompt · string

Optional

system_prompt · string
temperature · 0.0–1.5 · default 0.7
max_new_tokens · 1–4096 · default 256
stream · boolean · default false

Response shape

Default JSON response, or Server-Sent Events when streaming.

Non-streaming (default):

{
  "response": "Released 1.4 with streaming inference ...",
  "input_tokens": 124,
  "output_tokens": 87,
  "usage": {
    "coins_used": 31,
    "coins_per_minute": 200,
    "latency_ms": 1843
  },
  "model": {
    "base_model": "Qwen/Qwen2.5-7B-Instruct",
    "name": "supportbot"
  }
}

Streaming (when stream: true):

data: {"type":"chunk","text":"Released "}

data: {"type":"chunk","text":"1.4 "}

data: {"type":"chunk","text":"with "}

...

data: {"type":"done","input_tokens":12,"output_tokens":22,"usage":{"coins_used":18,"coins_per_minute":200,"latency_ms":1843}}

Frame types: chunk (incremental text the client appends), done (final usage + model info, stream ends), and error (fatal error mid-stream, stream ends with no done). Same auth, rate-limit, and billing semantics as the non-streaming endpoint.

Streaming examples

Complete consumers in JavaScript and Python.

JavaScript / TypeScript:

const r = await fetch("https://www.soramai.com/api/v1/inference", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.SORAMAI_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ prompt: "Hello", stream: true }),
});

const reader = r.body!.getReader();
const decoder = new TextDecoder();
let buf = "";
let output = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buf += decoder.decode(value, { stream: true });
  const frames = buf.split("\n\n");
  buf = frames.pop() ?? "";
  for (const frame of frames) {
    const data = frame.replace(/^data:\s*/, "").trim();
    if (!data) continue;
    const evt = JSON.parse(data);
    if (evt.type === "chunk") output += evt.text;
    if (evt.type === "done") console.log("usage:", evt.usage);
  }
}
console.log(output);

Python:

import json
import os
import requests

with requests.post(
    "https://www.soramai.com/api/v1/inference",
    headers={"Authorization": f"Bearer {os.environ['SORAMAI_API_KEY']}"},
    json={"prompt": "Hello", "stream": True},
    stream=True,
) as r:
    output = ""
    for raw in r.iter_lines(decode_unicode=True):
        if not raw or not raw.startswith("data:"):
            continue
        evt = json.loads(raw[len("data:"):].strip())
        if evt["type"] == "chunk":
            output += evt["text"]
        elif evt["type"] == "done":
            print(f"[usage] {evt['usage']}")
        elif evt["type"] == "error":
            raise RuntimeError(evt["text"])
print(output)

API keys

One key per deployment. Scoped to the owning account. Rotatable from the dashboard.

Format. Keys start with sk-ai- followed by a 32-character random string. Treat as a secret — never check into version control.
Shown once. On creation, the full key appears in the dashboard exactly one time. After that the dashboard shows only the preview (sk-ai-abc1...). If you lose a key, rotate it.
Rotation. Click Rotate on the deployment’s page. The old key is revoked immediately. Make sure your app has the new key before clicking.
Scope. Each key is tied to exactly one deployment. Cross-deployment use is impossible.

Rate limits

Two tiers of guard rails per deployment, configurable from the deployment settings page.

Per-minute throttle. Default 60 requests / minute. Bursts above this return 429 Too Many Requests with a Retry-After header.
Daily coin cap. Default 10,000 coins / day. Prevents runaway spend on a misconfigured client. Bursts above this return 429 too.
Both limits are per-deployment, not per-account. Raise or lower them on the deployment settings page.

Inference billing

Per-request, token-priced. No per-minute reservations, no warm-pool minimums.

Per-token rate. Each deployment shows its current rate in the dashboard, expressed as coins-per-minute of GPU time (which is then prorated by actual request duration).
No idle cost. Workers scale to 0 when no traffic arrives. You pay nothing for an idle deployment.
Wallet floor protection. If your wallet hits 0 during a streaming request, the in-flight response completes (you’re never cut off mid-token), but no new requests will start until you top up.
Refunds on infrastructure failure. If the inference pod errors before delivering tokens, the request is refunded in full and an audit row records the cause.

The rest of the docs.

Getting started →

Datasets →

JSONL & image formats, AI generation, multi-dataset merging.

Fine-tuning →

Tiers, hyperparameters, monitoring, refund policy.