Skip to content

Data Preparation Guide

Refer to Workflow · Step 2 for the primary instructions on organising data and running ml-split. This guide focuses on the structure requirements, customization knobs, and validation utilities that help you keep datasets reproducible.


Required Layout

data/<dataset>/
├── raw/
│   ├── class_a/
│   ├── class_b/
│   └── ...
└── splits/
    ├── test.txt
    ├── fold_0_train.txt
    ├── fold_0_val.txt
    └── ...
  • raw/ holds the original images (you create this once).
  • splits/ is generated by ml-split and contains text indexes referencing files in raw/ (never hand-edit).
  • model.num_classes must equal the number of class directories inside raw/.

Core Commands (Recap)

  • Create or verify the directory structure, then run:
    ml-split --raw_data data/my_dataset/raw --folds 5
    
  • The test set in test.txt is shared across all folds; only train/val indexes change.

See the workflow page for a complete end-to-end walkthrough including training afterwards.


Customizing ml-split

Flag Purpose Notes
--folds Number of CV folds (default 5) Higher folds = more training runs
--ratio train val test Override 0.7/0.15/0.15 split Values must sum to 1.0
--seed Random seed Keep constant for reproducibility
--output Destination for index files Defaults to <raw>/../splits

Example with custom ratios and seed:

ml-split --raw_data data/my_dataset/raw --folds 3 --ratio 0.8 0.1 0.1 --seed 123


Validation Utilities

Quick shell check

tree -L 2 data/my_dataset
wc -l data/my_dataset/splits/fold_0_*.txt data/my_dataset/splits/test.txt

Verification script

python - <<'PY'
from pathlib import Path

root = Path('data/my_dataset')
raw = root / 'raw'
splits = root / 'splits'

if not raw.exists():
    raise SystemExit('raw/ directory missing')

classes = sorted(p.name for p in raw.iterdir() if p.is_dir())
if not classes:
    raise SystemExit('No class folders found in raw/')

print(f'Classes ({len(classes)}): {classes}')

required = ['test.txt'] + [f'fold_{i}_{phase}.txt' for i in range(5) for phase in ('train', 'val')]
missing = [f for f in required if not (splits / f).exists()]
if missing:
    raise SystemExit(f'Split files missing: {missing}')

print('All expected split files present.')
PY

Adjust the required folds if you generated a different number.


Troubleshooting

  • Index file missing → Rerun ml-split; ensure the --raw_data path points to the directory that contains class folders.
  • Image path errors → Regenerate splits after moving images so references stay in sync.
  • Class count mismatch → Update model.num_classes in your config to match the number of folders in raw/.
  • Empty split files → Confirm each class contains images and that the ratios you chose leave enough samples for every fold.

Best Practices

  • Treat raw/ as read-only after generating splits; regenerate if you modify contents.
  • Commit the splits/ directory to version control to preserve experiment reproducibility.
  • Record the exact ml-split command (including ratios/seed) alongside experiments.
  • Use dataset analysis utilities (see Workflow Step 4 optional tasks) if you need class balance reports before training.

Once your dataset satisfies these checks, continue with Workflow Step 3 to generate the configuration.