Datasets and formats
The exact shape Soramai expects for training data.
Soramai validates every dataset before queueing a job. This page documents accepted formats, validation rules, and how to convert common dataset layouts into a Soramai-ready format.
Text · JSONL
Text training accepts a single .jsonl file. Each line is one training example, encoded as a JSON object.
Prompt / response (simplest)
{"prompt": "Summarize the changelog.", "response": "Soramai now supports..."}
{"prompt": "Triage this support ticket.", "response": "This appears to be..."}Chat (multi-turn)
{
"messages": [
{"role": "system", "content": "You are a support agent."},
{"role": "user", "content": "My deploy is stuck on 'building'."},
{"role": "assistant", "content": "That usually means..."}
]
}Validation rules
- UTF-8. Files in any other encoding are rejected.
- Maximum 8,192 tokens per example by default (configurable up to 32,768).
- At least 32 examples. Soramai warns under 200.
- Fields outside the documented schema are ignored, not an error.
Image · ZIP
Image training accepts a single .zip file. Each image is paired (by base filename) with an optional .txt caption.
dataset.zip ├── 01.png ├── 01.txt # caption (optional) ├── 02.png ├── 02.txt ├── 03.jpg └── 03.txt
Validation rules
- PNG, JPG, JPEG, WEBP. Other formats are rejected.
- Minimum dimension 512 px. Images are bucketed to common aspect ratios at training time.
- At least 8 images. Soramai warns under 15 for style LoRAs.
- Captions are optional. If absent, Soramai can auto-caption with a vision model before training. Auto-captions are editable before the run starts.
- Use a unique trigger token in captions if you want a specific token to invoke the LoRA at inference time.
Common conversions
If you already have data in a different shape, here is how to bring it over.
Alpaca / instruction-tuning
Concatenate instruction + input into prompt, and keep output as response.
ShareGPT / conversations
Map each conversation to the messages schema. Replace human with user and gpt with assistant.
CSV exports
Convert with a one-liner: csvtojson data.csv | jq -c '' > data.jsonl.
Folders of captioned images
Zip the folder directly. Soramai accepts any flat layout where each image has a matching .txt caption by base filename.