Data Preparation Guide¶

Refer to Workflow · Step 2 for the primary instructions on organising data and running ml-split. This guide focuses on the structure requirements, customization knobs, and validation utilities that help you keep datasets reproducible.

Required Layout¶

data/<dataset>/
├── raw/
│   ├── class_a/
│   ├── class_b/
│   └── ...
└── splits/
    ├── test.txt
    ├── fold_0_train.txt
    ├── fold_0_val.txt
    └── ...

raw/ holds the original images (you create this once).
splits/ is generated by ml-split and contains text indexes referencing files in raw/ (never hand-edit).
model.num_classes must equal the number of class directories inside raw/.

Core Commands (Recap)¶

Create or verify the directory structure, then run:

ml-split --raw_data data/my_dataset/raw --folds 5

The test set in test.txt is shared across all folds; only train/val indexes change.

See the workflow page for a complete end-to-end walkthrough including training afterwards.

Customizing `ml-split`¶

Flag	Purpose	Notes
`--folds`	Number of CV folds (default 5)	Higher folds = more training runs
`--ratio train val test`	Override 0.7/0.15/0.15 split	Values must sum to 1.0
`--seed`	Random seed	Keep constant for reproducibility
`--output`	Destination for index files	Defaults to `<raw>/../splits`

Example with custom ratios and seed:

ml-split --raw_data data/my_dataset/raw --folds 3 --ratio 0.8 0.1 0.1 --seed 123

Validation Utilities¶

Quick shell check¶

tree -L 2 data/my_dataset
wc -l data/my_dataset/splits/fold_0_*.txt data/my_dataset/splits/test.txt

Verification script¶

python - <<'PY'
from pathlib import Path

root = Path('data/my_dataset')
raw = root / 'raw'
splits = root / 'splits'

if not raw.exists():
    raise SystemExit('raw/ directory missing')

classes = sorted(p.name for p in raw.iterdir() if p.is_dir())
if not classes:
    raise SystemExit('No class folders found in raw/')

print(f'Classes ({len(classes)}): {classes}')

required = ['test.txt'] + [f'fold_{i}_{phase}.txt' for i in range(5) for phase in ('train', 'val')]
missing = [f for f in required if not (splits / f).exists()]
if missing:
    raise SystemExit(f'Split files missing: {missing}')

print('All expected split files present.')
PY

Adjust the required folds if you generated a different number.

Troubleshooting¶

Index file missing → Rerun ml-split; ensure the --raw_data path points to the directory that contains class folders.
Image path errors → Regenerate splits after moving images so references stay in sync.
Class count mismatch → Update model.num_classes in your config to match the number of folders in raw/.
Empty split files → Confirm each class contains images and that the ratios you chose leave enough samples for every fold.

Best Practices¶

Treat raw/ as read-only after generating splits; regenerate if you modify contents.
Commit the splits/ directory to version control to preserve experiment reproducibility.
Record the exact ml-split command (including ratios/seed) alongside experiments.
Use dataset analysis utilities (see Workflow Step 4 optional tasks) if you need class balance reports before training.

Once your dataset satisfies these checks, continue with Workflow Step 3 to generate the configuration.