Datasets and formats

The exact shape Soramai expects for training data.

Soramai validates every dataset before queueing a job. This page documents accepted formats, validation rules, and how to convert common dataset layouts into a Soramai-ready format.

Text · JSONL

Text training accepts a single .jsonl file. Each line is one training example, encoded as a JSON object.

Prompt / response (simplest)

{"prompt": "Summarize the changelog.", "response": "Soramai now supports..."}
{"prompt": "Triage this support ticket.", "response": "This appears to be..."}

Chat (multi-turn)

{
  "messages": [
    {"role": "system", "content": "You are a support agent."},
    {"role": "user", "content": "My deploy is stuck on 'building'."},
    {"role": "assistant", "content": "That usually means..."}
  ]
}

Validation rules

UTF-8. Files in any other encoding are rejected.
Maximum 8,192 tokens per example by default (configurable up to 32,768).
At least 32 examples. Soramai warns under 200.
Fields outside the documented schema are ignored, not an error.

Image · ZIP

Image training accepts a single .zip file. Each image is paired (by base filename) with an optional .txt caption.

dataset.zip
├── 01.png
├── 01.txt        # caption (optional)
├── 02.png
├── 02.txt
├── 03.jpg
└── 03.txt

Validation rules

PNG, JPG, JPEG, WEBP. Other formats are rejected.
Minimum dimension 512 px. Images are bucketed to common aspect ratios at training time.
At least 8 images. Soramai warns under 15 for style LoRAs.
Captions are optional. If absent, Soramai can auto-caption with a vision model before training. Auto-captions are editable before the run starts.
Use a unique trigger token in captions if you want a specific token to invoke the LoRA at inference time.

Common conversions

If you already have data in a different shape, here is how to bring it over.

Alpaca / instruction-tuning

Concatenate instruction + input into prompt, and keep output as response.

ShareGPT / conversations

Map each conversation to the messages schema. Replace human with user and gpt with assistant.

CSV exports

Convert with a one-liner: csvtojson data.csv | jq -c '' > data.jsonl.

Folders of captioned images

Zip the folder directly. Soramai accepts any flat layout where each image has a matching .txt caption by base filename.