Soramai · Docs

Inference & Deploy.

Two surfaces for running a fine-tuned adapter: an in-dashboard Playground for testing, and a per-deployment HTTPS endpoint for production traffic. Both are autoscaling and billed only when serving.

Playground vs Deploy API — which to use

Same model, two access modes. Pick based on whether the traffic is yours or your customers'.

AspectPlaygroundDeploy API
Best forManual testing, smoke-checks, demosProduction traffic from your app
AuthBrowser sessionAPI key (sk-ai-...)
BillingPer-minute of warm pod timePer request (token-priced)
Cold start~30 s on first message~2 s typical (warm pool)
AutoscalingOne pod per session0 → N workers, automatic
Idle costAuto-shutdown after 10 min$0 — workers stop when idle
Rate limitNone (one session is one pod)Configurable per deployment

Deploying an adapter

One click from the dashboard. The fine-tuned adapter is wrapped into an autoscaling endpoint.

  1. 1Open soramai.com/deployments and click New deployment.
  2. 2Pick a fine-tuned adapter from your models list.
  3. 3Name the deployment (for your reference — not surfaced to API consumers). Confirm.
  4. 4Soramai provisions a serverless inference endpoint. You’ll get back a URL and an API key. The key is shown once — copy it immediately.

Image deployments are currently routed to the in-app playground only. A dedicated image inference API is on the roadmap.

Making a request

HTTP POST with a Bearer token. Response is JSON by default, SSE stream if you request it.

curl https://soramai.com/api/v1/inference \
  -H "Authorization: Bearer $SORAMAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Summarise this changelog entry.",
    "system_prompt": "You are a concise technical writer.",
    "temperature": 0.7,
    "max_new_tokens": 256
  }'

Required

  • prompt · string

Optional

  • system_prompt · string
  • temperature · 0.0–1.5 · default 0.7
  • max_new_tokens · 1–4096 · default 256
  • stream · boolean · default false

Response shape

Default JSON response, or Server-Sent Events when streaming.

Non-streaming (default):

{
  "response": "Released 1.4 with streaming inference ...",
  "input_tokens": 124,
  "output_tokens": 87,
  "usage": {
    "coins_used": 31,
    "coins_per_minute": 200,
    "latency_ms": 1843
  },
  "model": {
    "base_model": "Qwen/Qwen2.5-7B-Instruct",
    "name": "supportbot"
  }
}

Streaming (when stream: true):

data: {"type":"chunk","text":"Released "}

data: {"type":"chunk","text":"1.4 "}

data: {"type":"chunk","text":"with "}

...

data: {"type":"done","input_tokens":12,"output_tokens":22,"usage":{"coins_used":18,"coins_per_minute":200,"latency_ms":1843}}

Frame types: chunk (incremental text the client appends), done (final usage + model info, stream ends), and error (fatal error mid-stream, stream ends with no done). Same auth, rate-limit, and billing semantics as the non-streaming endpoint.

Streaming examples

Complete consumers in JavaScript and Python.

JavaScript / TypeScript:

const r = await fetch("https://soramai.com/api/v1/inference", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.SORAMAI_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ prompt: "Hello", stream: true }),
});

const reader = r.body!.getReader();
const decoder = new TextDecoder();
let buf = "";
let output = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buf += decoder.decode(value, { stream: true });
  const frames = buf.split("\n\n");
  buf = frames.pop() ?? "";
  for (const frame of frames) {
    const data = frame.replace(/^data:\s*/, "").trim();
    if (!data) continue;
    const evt = JSON.parse(data);
    if (evt.type === "chunk") output += evt.text;
    if (evt.type === "done") console.log("usage:", evt.usage);
  }
}
console.log(output);

Python:

import json
import os
import requests

with requests.post(
    "https://soramai.com/api/v1/inference",
    headers={"Authorization": f"Bearer {os.environ['SORAMAI_API_KEY']}"},
    json={"prompt": "Hello", "stream": True},
    stream=True,
) as r:
    output = ""
    for raw in r.iter_lines(decode_unicode=True):
        if not raw or not raw.startswith("data:"):
            continue
        evt = json.loads(raw[len("data:"):].strip())
        if evt["type"] == "chunk":
            output += evt["text"]
        elif evt["type"] == "done":
            print(f"[usage] {evt['usage']}")
        elif evt["type"] == "error":
            raise RuntimeError(evt["text"])
print(output)

API keys

One key per deployment. Scoped to the owning account. Rotatable from the dashboard.

  • Format. Keys start with sk-ai- followed by a 32-character random string. Treat as a secret — never check into version control.
  • Shown once. On creation, the full key appears in the dashboard exactly one time. After that the dashboard shows only the preview (sk-ai-abc1...). If you lose a key, rotate it.
  • Rotation. Click Rotate on the deployment’s page. The old key is revoked immediately. Make sure your app has the new key before clicking.
  • Scope. Each key is tied to exactly one deployment. Cross-deployment use is impossible.

Rate limits

Two tiers of guard rails per deployment, configurable from the deployment settings page.

  • Per-minute throttle. Default 60 requests / minute. Bursts above this return 429 Too Many Requests with a Retry-After header.
  • Daily coin cap. Default 10,000 coins / day. Prevents runaway spend on a misconfigured client. Bursts above this return 429 too.
  • Both limits are per-deployment, not per-account. Raise or lower them on the deployment settings page.

Inference billing

Per-request, token-priced. No per-minute reservations, no warm-pool minimums.

  • Per-token rate. Each deployment shows its current rate in the dashboard, expressed as coins-per-minute of GPU time (which is then prorated by actual request duration).
  • No idle cost. Workers scale to 0 when no traffic arrives. You pay nothing for an idle deployment.
  • Wallet floor protection. If your wallet hits 0 during a streaming request, the in-flight response completes (you’re never cut off mid-token), but no new requests will start until you top up.
  • Refunds on infrastructure failure. If the inference pod errors before delivering tokens, the request is refunded in full and an audit row records the cause.