Unified Workflow Guide¶
Single end-to-end workflow for preparing data, training across folds, tuning, evaluating, and exporting PyTorch image classifiers built with this framework.
Overview & Prerequisites¶
- Python 3.8+, PyTorch 2.0+, optional CUDA-enabled GPU
uvinstalled for dependency management- Dataset organized as
data/<name>/raw/<class_name>/*.jpg - Sufficient disk space under
runs/for fold artifacts
Step 1: Install Dependencies¶
Options:
uv pip install -e ".[dev]"– development utilities (pytest, ruff, mkdocs)uv pip install -e ".[optuna]"– Optuna hyperparameter search supportuv pip install -e ".[dp]"– differential privacy training- Combine extras as needed, e.g.
".[dev,optuna]"
Step 2: Prepare the Dataset¶
- Ensure images are stored under
data/<name>/raw/<class>/image.jpg. - Generate cross-validation splits (required for later steps):
Outputs index files in data/my_dataset/splits/ including a shared test.txt.
Optional checks:
- Run dataset analysis utilities (e.g.,
ml_src.core.data.analyze_dataset) if you want class balance or statistics before training.
Step 3: Generate & Customize Configuration¶
- Create a baseline config using detected dataset settings:
- For built-in Optuna search space, add
--optuna. - Edit
configs/my_dataset_config.yamlto confirm: data.dataset_name,data.data_dir,data.foldmodel.num_classes, architecture, pretrained weightstraining.trainer_type(defaultstandard)- Optional features such as EMA (
training.ema) or callbacks
Step 4: Optional Pre-Training Utilities¶
- Learning rate finder:
Adjust ranges with --start_lr, --end_lr, --num_iter, or change divergence sensitivity via --diverge_threshold.
- Dataset reports: Use analysis helpers to generate statistics or plots before tuning (recommended for imbalanced datasets).
Step 5: Hyperparameter Search (Optional, before full CV)¶
- Ensure Optuna extras are installed if running search.
- Run search on a representative fold (commonly fold 0):
- Inspect results via
ml-visualise --mode search --study-name <study>. - Reuse the exported
runs/optuna_studies/<study>/best_config.yamlfor subsequent fold training.
Tip: For quick experiments, you can skip this step and rely on defaults, but tuned hyperparameters usually transfer better when training every fold.
Step 6: Train Across Folds¶
Cross-validation is the standard workflow. Train each fold with the finalized configuration (from Step 3 or best config from Step 5).
Each run writes to runs/<dataset>_*_fold_<n>/ with saved config, logs, checkpoints, and TensorBoard events.
Options:
- Quick iteration: Run a single fold (e.g.,
--fold 0) to validate the pipeline fast. - CLI overrides:
--batch_size,--lr,--num_epochs,--num_workers, etc. - Resume:
ml-train --config ... --resume runs/<run_name>/weights/last.pt - Trainer types: set
training.trainer_typetostandard,mixed_precision,accelerate, ordpbefore training. - EMA: enable in config via
training.ema.enabled: truefor smoothing improvements.
Step 7: Monitor & Review Progress¶
- Launch TensorBoard per run or root directory:
- Inspect logs:
cat runs/<run>/logs/train.logor tail for live updates. - GPU monitoring (if applicable):
watch -n 1 nvidia-smi - Clean up logs when finished:
ml-visualise --mode clean --run_dir runs/<run>
Step 8: Evaluate & Run Inference¶
- Use the best checkpoint from each fold to assess validation/test metrics:
- Evaluate alternate splits or configs with
--splitor--configoverrides. - Options for improved accuracy:
- TTA: add
--ttaand optional--tta-augmentations ... - Ensemble:
ml-inference --ensemble runs/my_dataset_fold_0/weights/best.pt runs/my_dataset_fold_1/weights/best.pt ... - TTA + ensemble: combine flags for maximum robustness (slowest path)
Step 9: Export & Deployment Preparation¶
- Export to ONNX from the preferred checkpoint (single fold or ensemble representative):
- Add validation or benchmarking options as needed:
--validate(comprehensive)--validate-basic--benchmark- Custom
--output,--input_size,--opset
Step 10: Follow-up & Maintenance¶
- Aggregate metrics across folds (e.g., averaging validation/test scores stored in each run directory).
- Archive or prune large
runs/entries when finished. - Iterate on configuration, callbacks, or data augmentations based on insights from monitoring and inference.
- For fast smoke tests, temporarily reduce epochs (
--num_epochs 2) or batch size before returning to full cross-validation runs.
Reference Outputs per Step¶
- Step 2:
data/<name>/splits/fold_{k}_{train,val}.txt,test.txt - Step 3:
configs/<name>_config.yaml - Step 4:
runs/lr_finder_<timestamp>/{lr_plot.png,results.json,logs/} - Step 6:
runs/<dataset>_*_fold_<n>/{config.yaml,summary.txt,weights/,logs/,tensorboard/} - Step 5 (if used):
runs/optuna_studies/<study>/{best_config.yaml,trial_*/} - Step 9:
runs/<run>/weights/best.onnxplus optional validation reports
Troubleshooting Snapshot¶
- Verify install:
pip list | grep ml-classifier - CLI help:
ml-train --help,ml-search --help,ml-visualise --help - Validate config loads:
python -c "from ml_src.core.config import load_config; print(load_config('configs/my_dataset_config.yaml'))" - Check CUDA:
python -c "import torch; print(torch.cuda.is_available())"