Soramai · Docs
Inference & Deploy.
Two surfaces for running a fine-tuned adapter: an in-dashboard Playground for testing, and a per-deployment HTTPS endpoint for production traffic. Both are autoscaling and billed only when serving.
Playground vs Deploy API — which to use
Same model, two access modes. Pick based on whether the traffic is yours or your customers'.
| Aspect | Playground | Deploy API |
|---|---|---|
| Best for | Manual testing, smoke-checks, demos | Production traffic from your app |
| Auth | Browser session | API key (sk-ai-...) |
| Billing | Per-minute of warm pod time | Per request (token-priced) |
| Cold start | ~30 s on first message | ~2 s typical (warm pool) |
| Autoscaling | One pod per session | 0 → N workers, automatic |
| Idle cost | Auto-shutdown after 10 min | $0 — workers stop when idle |
| Rate limit | None (one session is one pod) | Configurable per deployment |
Deploying an adapter
One click from the dashboard. The fine-tuned adapter is wrapped into an autoscaling endpoint.
- 1Open soramai.com/deployments and click New deployment.
- 2Pick a fine-tuned adapter from your models list.
- 3Name the deployment (for your reference — not surfaced to API consumers). Confirm.
- 4Soramai provisions a serverless inference endpoint. You’ll get back a URL and an API key. The key is shown once — copy it immediately.
Image deployments are currently routed to the in-app playground only. A dedicated image inference API is on the roadmap.
Making a request
HTTP POST with a Bearer token. Response is JSON by default, SSE stream if you request it.
curl https://soramai.com/api/v1/inference \
-H "Authorization: Bearer $SORAMAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Summarise this changelog entry.",
"system_prompt": "You are a concise technical writer.",
"temperature": 0.7,
"max_new_tokens": 256
}'Required
prompt· string
Optional
system_prompt· stringtemperature· 0.0–1.5 · default 0.7max_new_tokens· 1–4096 · default 256stream· boolean · default false
Response shape
Default JSON response, or Server-Sent Events when streaming.
Non-streaming (default):
{
"response": "Released 1.4 with streaming inference ...",
"input_tokens": 124,
"output_tokens": 87,
"usage": {
"coins_used": 31,
"coins_per_minute": 200,
"latency_ms": 1843
},
"model": {
"base_model": "Qwen/Qwen2.5-7B-Instruct",
"name": "supportbot"
}
}Streaming (when stream: true):
data: {"type":"chunk","text":"Released "}
data: {"type":"chunk","text":"1.4 "}
data: {"type":"chunk","text":"with "}
...
data: {"type":"done","input_tokens":12,"output_tokens":22,"usage":{"coins_used":18,"coins_per_minute":200,"latency_ms":1843}}Frame types: chunk (incremental text the client appends), done (final usage + model info, stream ends), and error (fatal error mid-stream, stream ends with no done). Same auth, rate-limit, and billing semantics as the non-streaming endpoint.
Streaming examples
Complete consumers in JavaScript and Python.
JavaScript / TypeScript:
const r = await fetch("https://soramai.com/api/v1/inference", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.SORAMAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ prompt: "Hello", stream: true }),
});
const reader = r.body!.getReader();
const decoder = new TextDecoder();
let buf = "";
let output = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buf += decoder.decode(value, { stream: true });
const frames = buf.split("\n\n");
buf = frames.pop() ?? "";
for (const frame of frames) {
const data = frame.replace(/^data:\s*/, "").trim();
if (!data) continue;
const evt = JSON.parse(data);
if (evt.type === "chunk") output += evt.text;
if (evt.type === "done") console.log("usage:", evt.usage);
}
}
console.log(output);Python:
import json
import os
import requests
with requests.post(
"https://soramai.com/api/v1/inference",
headers={"Authorization": f"Bearer {os.environ['SORAMAI_API_KEY']}"},
json={"prompt": "Hello", "stream": True},
stream=True,
) as r:
output = ""
for raw in r.iter_lines(decode_unicode=True):
if not raw or not raw.startswith("data:"):
continue
evt = json.loads(raw[len("data:"):].strip())
if evt["type"] == "chunk":
output += evt["text"]
elif evt["type"] == "done":
print(f"[usage] {evt['usage']}")
elif evt["type"] == "error":
raise RuntimeError(evt["text"])
print(output)API keys
One key per deployment. Scoped to the owning account. Rotatable from the dashboard.
- Format. Keys start with
sk-ai-followed by a 32-character random string. Treat as a secret — never check into version control. - Shown once. On creation, the full key appears in the dashboard exactly one time. After that the dashboard shows only the preview (
sk-ai-abc1...). If you lose a key, rotate it. - Rotation. Click Rotate on the deployment’s page. The old key is revoked immediately. Make sure your app has the new key before clicking.
- Scope. Each key is tied to exactly one deployment. Cross-deployment use is impossible.
Rate limits
Two tiers of guard rails per deployment, configurable from the deployment settings page.
- Per-minute throttle. Default 60 requests / minute. Bursts above this return
429 Too Many Requestswith aRetry-Afterheader. - Daily coin cap. Default 10,000 coins / day. Prevents runaway spend on a misconfigured client. Bursts above this return 429 too.
- Both limits are per-deployment, not per-account. Raise or lower them on the deployment settings page.
Inference billing
Per-request, token-priced. No per-minute reservations, no warm-pool minimums.
- Per-token rate. Each deployment shows its current rate in the dashboard, expressed as coins-per-minute of GPU time (which is then prorated by actual request duration).
- No idle cost. Workers scale to 0 when no traffic arrives. You pay nothing for an idle deployment.
- Wallet floor protection. If your wallet hits 0 during a streaming request, the in-flight response completes (you’re never cut off mid-token), but no new requests will start until you top up.
- Refunds on infrastructure failure. If the inference pod errors before delivering tokens, the request is refunded in full and an audit row records the cause.