Data Preparation Guide¶
Refer to Workflow · Step 2 for the primary instructions on organising data and running ml-split. This guide focuses on the structure requirements, customization knobs, and validation utilities that help you keep datasets reproducible.
Required Layout¶
data/<dataset>/
├── raw/
│ ├── class_a/
│ ├── class_b/
│ └── ...
└── splits/
├── test.txt
├── fold_0_train.txt
├── fold_0_val.txt
└── ...
raw/holds the original images (you create this once).splits/is generated byml-splitand contains text indexes referencing files inraw/(never hand-edit).model.num_classesmust equal the number of class directories insideraw/.
Core Commands (Recap)¶
- Create or verify the directory structure, then run:
- The test set in
test.txtis shared across all folds; only train/val indexes change.
See the workflow page for a complete end-to-end walkthrough including training afterwards.
Customizing ml-split¶
| Flag | Purpose | Notes |
|---|---|---|
--folds |
Number of CV folds (default 5) | Higher folds = more training runs |
--ratio train val test |
Override 0.7/0.15/0.15 split | Values must sum to 1.0 |
--seed |
Random seed | Keep constant for reproducibility |
--output |
Destination for index files | Defaults to <raw>/../splits |
Example with custom ratios and seed:
Validation Utilities¶
Quick shell check¶
Verification script¶
python - <<'PY'
from pathlib import Path
root = Path('data/my_dataset')
raw = root / 'raw'
splits = root / 'splits'
if not raw.exists():
raise SystemExit('raw/ directory missing')
classes = sorted(p.name for p in raw.iterdir() if p.is_dir())
if not classes:
raise SystemExit('No class folders found in raw/')
print(f'Classes ({len(classes)}): {classes}')
required = ['test.txt'] + [f'fold_{i}_{phase}.txt' for i in range(5) for phase in ('train', 'val')]
missing = [f for f in required if not (splits / f).exists()]
if missing:
raise SystemExit(f'Split files missing: {missing}')
print('All expected split files present.')
PY
Adjust the required folds if you generated a different number.
Troubleshooting¶
- Index file missing → Rerun
ml-split; ensure the--raw_datapath points to the directory that contains class folders. - Image path errors → Regenerate splits after moving images so references stay in sync.
- Class count mismatch → Update
model.num_classesin your config to match the number of folders inraw/. - Empty split files → Confirm each class contains images and that the ratios you chose leave enough samples for every fold.
Best Practices¶
- Treat
raw/as read-only after generating splits; regenerate if you modify contents. - Commit the
splits/directory to version control to preserve experiment reproducibility. - Record the exact
ml-splitcommand (including ratios/seed) alongside experiments. - Use dataset analysis utilities (see Workflow Step 4 optional tasks) if you need class balance reports before training.
Once your dataset satisfies these checks, continue with Workflow Step 3 to generate the configuration.