Soramai · Docs

Datasets.

Every supervised fine-tune starts with a dataset. Soramai accepts JSONL for text fine-tuning and ZIP archives for image LoRA fine-tuning, and offers three different ways to produce one: bring your own, type by hand, or generate with AI.

Text dataset format (JSONL)

One JSON object per line. Each object represents a single fine-tuning example.

{"prompt": "Summarise this changelog: ...", "response": "1.4 adds streaming inference ..."}
{"prompt": "Classify ticket as bug/feature/question: ...", "response": "feature"}
{"prompt": "Translate to French: 'It is raining'", "response": "Il pleut"}

Required fields

  • prompt — the user input the model will see at inference time
  • response — the target output you want the model to learn

Optional fields

  • system_prompt — per-example system message
  • metadata — arbitrary object, preserved on the row but not used for fine-tuning

Limits

  • · Maximum file size: 20 MB per dataset.
  • · Maximum rows: 50,000 per single dataset.
  • · Maximum sequence length: 2,048 tokens per row (longer rows are truncated to fit the base model context).
  • · Encoding: UTF-8. No BOM. LF line endings (Windows CRLF is auto-normalised).

Image dataset format (ZIP archive)

For fine-tuning image LoRAs on FLUX or SDXL base checkpoints. ZIP archive of image + caption pairs.

my-dataset.zip
├── 001.jpg
├── 001.txt          ← caption for 001.jpg
├── 002.png
├── 002.txt
├── 003.webp
├── 003.txt
└── ... up to 500 image + caption pairs

Image rules

  • · Supported formats: JPG, PNG, WebP
  • · Recommended size: 512×512 to 2048×2048
  • · Aspect ratio: any (auto-cropped/padded during fine-tuning)
  • · Maximum total ZIP size: 500 MB
  • · Minimum images: 5 · Maximum: 500

Caption rules

  • · One .txt per image, identical basename
  • · Plain UTF-8 text
  • · Booru-style tags work well: 1girl, blue eyes, smiling
  • · Use a unique trigger word in every caption to call the LoRA at inference time

AI-assisted dataset generation

Describe what you want and Soramai produces the JSONL for you. Useful for prototypes, synthetic data augmentation, or when you don't have examples on hand.

  1. 1Open Dataset Studio, click Generate with AI.
  2. 2Describe the task in plain English: “Customer support replies to billing questions, polite tone, 2–3 sentences”.
  3. 3Pick a row count: 100, 500, or 1000. Larger counts cost more but yield more stable fine-tuning.
  4. 4Click Generate. Rows stream into the editor live — you can stop, edit, or restart at any point.
  5. 5When done, click Save dataset. The full JSONL is stored in your account and immediately usable by the fine-tuning page.

Costs

  • · 100 rows: ~0.5 coins (about $0.005)
  • · 500 rows: ~2.5 coins
  • · 1000 rows: ~5 coins

Generation runs against a high-quality teacher model. Per-row cost scales with prompt complexity — Soramai shows the live estimate in the dashboard before you commit.

Crash recovery

Generation jobs are persisted server-side. If you close the tab or your machine crashes mid-generation, opening the same dataset again shows a “Resume in-progress generation?” banner. Recovery window is 30 minutes after the job started.

Merging multiple datasets

Combine several smaller datasets into one fine-tuning run without writing a single line of code.

You may have generated five 1000-row datasets for the same task at different times. Rather than re-uploading them as one big file, select all of them in the fine-tuning page and Soramai concatenates them server-side into a single merged dataset before launching the pod.

  1. 1Open the fine-tuning page and click Pick from My Datasets.
  2. 2Multi-select up to 10 datasets. The footer shows the combined row count and total byte size.
  3. 3Click Use N selected. Soramai computes a SHA-256 of the dataset-id list, deduplicates against any previous merge of the same selection, and serves a single signed URL to the pod.
  4. 4The fine-tuning run proceeds normally — the worker sees one file, with all rows shuffled by the data loader.

Merge limits

  • · Maximum datasets per merge: 10
  • · Maximum merged file size: 200 MB
  • · Merged files are temporary (7-day TTL); recreated on next fine-tuning launch if needed

Validation and errors

Soramai validates every dataset before queuing the pod. If validation fails, the run is rejected with a clear error and you are not charged.

  • Empty rows (missing prompt or response) are rejected with the exact line number that failed.
  • Invalid JSON on any line aborts the upload — the error names the offending line and the parser’s diagnosis.
  • Oversize rows (after tokenisation) are silently truncated to fit the context window. The fine-tuning log records the count of affected rows.
  • Duplicate detection is not enforced — duplicate rows are accepted and weighted equally during fine-tuning.

Best practices

Things that materially affect fine-tuning quality.

  • Smaller, cleaner datasets beat larger noisy ones. 500 carefully-written examples typically outperform 5,000 scraped ones for any single task.
  • Diversity within the task. If you want a customer support bot, include examples covering billing, technical support, complaints, compliments, edge cases. Don’t fine-tune on only one ticket category.
  • Match inference-time format. Your fine-tuning prompts should look like what real users will type. If users send single questions, fine-tune on single questions — not multi-turn conversations.
  • Consistent style in responses. Tone, length, and structure of the response field set the model’s output distribution. Variance here is what the model will replicate.
  • Start small. 100-step run with 200 rows costs pennies and tells you if your dataset is in the right shape. Scale up once that’s clean.