CLI Override System¶
For day-to-day usage examples, see the relevant steps in the Workflow Guide (especially training and inference). This reference focuses solely on how CLI flags map onto configuration entries.
Overview¶
Command-line arguments override YAML configuration, enabling quick experimentation without editing config files. This is perfect for hyperparameter search and rapid iteration.
How It Works¶
1. Load base config from YAML (ml_src/config_template.yaml)
2. Apply CLI argument overrides
3. Create run directory based on overrides
4. Save final config to runs/{run_name}/config.yaml
Override Mapping¶
| CLI Argument | Config Path | Type | Example |
|---|---|---|---|
--config |
(Specifies YAML file) | string | --config custom.yaml |
--dataset_name |
data.dataset_name |
string | --dataset_name my_dataset |
--data_dir |
data.data_dir |
string | --data_dir data/my_dataset |
--fold |
data.fold |
int | --fold 1 |
--batch_size |
training.batch_size |
int | --batch_size 32 |
--num_workers |
data.num_workers |
int | --num_workers 8 |
--num_epochs |
training.num_epochs |
int | --num_epochs 50 |
--lr |
optimizer.lr |
float | --lr 0.01 |
--momentum |
optimizer.momentum |
float | --momentum 0.95 |
--step_size |
scheduler.step_size |
int | --step_size 10 |
--gamma |
scheduler.gamma |
float | --gamma 0.5 |
--device |
training.device |
string | --device cuda:1 |
--resume |
(Special) | string | --resume runs/base/last.pt |
Usage Examples¶
Basic Override¶
Multiple Overrides¶
Custom Config File¶
# Use custom config
ml-train --config configs/my_config.yaml
# Custom config + overrides
ml-train --config configs/my_config.yaml --batch_size 64
Resume Training¶
# Resume from last checkpoint
ml-train --resume runs/base/last.pt
# Resume and train more epochs
ml-train --resume runs/base/last.pt --num_epochs 50
Run Directory Naming¶
Run directories are automatically named based on the dataset name, fold number, and hyperparameter overrides. This creates organized, self-documenting experiment folders.
Naming Rules¶
The format is: runs/{dataset_name}_{overrides}_fold_{fold_num}
No overrides:
With fold override:
With hyperparameter overrides:
ml-train --fold 0 --batch_size 32
# Creates: runs/hymenoptera_batch_32_fold_0/
ml-train --fold 0 --lr 0.01
# Creates: runs/hymenoptera_lr_0.01_fold_0/
Multiple overrides:
ml-train --fold 1 --batch_size 32 --num_epochs 50
# Creates: runs/hymenoptera_batch_32_epochs_50_fold_1/
ml-train --fold 1 --batch_size 32 --lr 0.01 --num_epochs 50
# Creates: runs/hymenoptera_batch_32_epochs_50_lr_0.01_fold_1/
Which Parameters Affect Run Name?¶
Included in run name (affect training results):
- batch_size
- num_epochs
- lr (learning rate)
Excluded from run name (don't affect results):
- num_workers (just loading speed)
- data_dir (just data location)
- device (just hardware selection)
Why This Design?¶
- Self-documenting: Run name tells you what changed
- Prevents overwrites: Different hyperparams → different folders
- Easy comparison: Compare runs by folder name
Common Workflows¶
1. Quick Experimentation¶
# Try different batch sizes
ml-train --batch_size 16
ml-train --batch_size 32
ml-train --batch_size 64
# Compare results in:
# - runs/batch_16/
# - runs/batch_32/
# - runs/batch_64/
2. Hyperparameter Search¶
# Grid search over LR and batch size
for lr in 0.001 0.01 0.1; do
for bs in 16 32 64; do
ml-train --lr $lr --batch_size $bs
done
done
# Creates organized folders:
# runs/batch_16_lr_0.001/
# runs/batch_16_lr_0.01/
# runs/batch_16_lr_0.1/
# runs/batch_32_lr_0.001/
# ... etc
3. Different Datasets¶
# Train on different datasets (same hyperparams)
ml-train --data_dir data/dataset1
ml-train --data_dir data/dataset2
ml-train --data_dir data/dataset3
# Note: All create runs/base/ (data_dir doesn't affect run name)
# Solution: Use different configs or add meaningful overrides
ml-train --data_dir data/dataset1 --num_epochs 25
ml-train --data_dir data/dataset2 --num_epochs 50
4. GPU Selection¶
# Use specific GPU (doesn't affect run name)
ml-train --device cuda:0
ml-train --device cuda:1
# Or via environment variable
CUDA_VISIBLE_DEVICES=1 ml-train
5. Resume and Extend¶
# Train 25 epochs
ml-train --num_epochs 25
# Saves to: runs/epochs_25/
# If not converged, resume and train more
ml-train --resume runs/epochs_25/last.pt --num_epochs 50
# Note: Continues training up to epoch 50 total
Advanced Usage¶
Config File + Overrides¶
You can combine custom configs with CLI overrides:
# configs/production.yaml has most settings
# Override just batch_size
ml-train --config configs/production.yaml --batch_size 128
Use case: Base config for a project, tweak individual params
Debugging with Overrides¶
# Quick debug run (small batch, few epochs, no workers)
ml-train --batch_size 2 --num_epochs 2 --num_workers 0
# Creates: runs/batch_2_epochs_2/
Full Override Example¶
ml-train \
--config ml_src/config_template.yaml \
--data_dir /mnt/datasets/imagenet \
--batch_size 64 \
--num_workers 8 \
--num_epochs 100 \
--lr 0.01 \
--momentum 0.9 \
--step_size 30 \
--gamma 0.1 \
--device cuda:0
# Creates: runs/batch_64_epochs_100_lr_0.01/
# With all your custom settings
Verifying Configuration¶
After training starts, check what config was actually used:
# View saved config
cat runs/{run_name}/config.yaml
# Or during training
cat runs/batch_32_lr_0.01/config.yaml
This shows the final merged config (YAML + overrides).
Best Practices¶
1. Start with Defaults¶
# First, train with base config
ml-train
# Then experiment with overrides
ml-train --lr 0.01
ml-train --batch_size 32
2. Systematic Exploration¶
# Test one parameter at a time
ml-train --lr 0.0001
ml-train --lr 0.001
ml-train --lr 0.01
# Then combine best values
ml-train --lr 0.001 --batch_size 32
3. Document Experiments¶
# Create a script to track experiments
cat > experiments.sh << 'EOF'
#!/bin/bash
# Experiment 1: Baseline
ml-train
# Experiment 2: Higher LR
ml-train --lr 0.01
# Experiment 3: Larger batch
ml-train --batch_size 32 --lr 0.002
EOF
chmod +x experiments.sh
./experiments.sh
4. Use Shell Scripts for Sweeps¶
# sweep.sh - Hyperparameter sweep
#!/bin/bash
learning_rates=(0.0001 0.001 0.01)
batch_sizes=(16 32 64)
for lr in "${learning_rates[@]}"; do
for bs in "${batch_sizes[@]}"; do
echo "Training with lr=$lr, batch_size=$bs"
ml-train --lr $lr --batch_size $bs
done
done
5. Compare Results¶
# After training multiple runs, compare
ls runs/
# Output:
# base/
# batch_32/
# batch_32_lr_0.01/
# epochs_50/
# lr_0.01/
# Use TensorBoard to compare
tensorboard --logdir runs/
Troubleshooting¶
Run Directory Already Exists¶
Problem:
Solution:
# Option 1: Delete old run
rm -rf runs/batch_32/
# Option 2: Use different hyperparams (creates new folder)
ml-train --batch_size 32 --lr 0.01
# Option 3: Resume training
ml-train --resume runs/batch_32/last.pt
Override Not Taking Effect¶
Problem: Changed CLI arg but config still shows old value
Solution:
# Check you're using correct argument name
ml-train --lr 0.01 # ✅ Correct
ml-train --learning_rate 0.01 # ❌ Wrong (not supported)
# Verify in saved config
cat runs/{run_name}/config.yaml
Complex Config Not Overridable¶
Problem: Want to override nested config (e.g., transforms)
Solution: CLI only supports common parameters. For complex changes:
# Option 1: Create custom config file
# configs/custom_transforms.yaml
# ... your changes ...
ml-train --config configs/custom_transforms.yaml
# Option 2: Modify base config temporarily
# Edit ml_src/config_template.yaml directly
Comparison with Other Approaches¶
CLI Overrides (Current)¶
✅ Quick experimentation
✅ No file editing
✅ Good for sweeps
❌ Limited to common params
Custom Config Files¶
✅ Full control
✅ Version controlled
✅ Shareable
❌ Requires file creation
Hybrid Approach (Recommended)¶
✅ Best of both worlds
Integration with Other Tools¶
TensorBoard¶
# Train multiple experiments
ml-train --lr 0.001
ml-train --lr 0.01
ml-train --lr 0.1
# Compare in TensorBoard
tensorboard --logdir runs/
# Open http://localhost:6006
# All runs visible with meaningful names
Shell Scripting¶
# Parallel training (if you have multiple GPUs)
ml-train --device cuda:0 --lr 0.001 &
ml-train --device cuda:1 --lr 0.01 &
wait
Python Scripting¶
import subprocess
learning_rates = [0.0001, 0.001, 0.01, 0.1]
for lr in learning_rates:
cmd = f"ml-train --lr {lr}"
subprocess.run(cmd, shell=True, check=True)
Summary¶
Key Takeaways: 1. CLI overrides make experimentation fast 2. Run names are auto-generated from overrides 3. Systematic sweeps are easy 4. Original config never modified 5. Final config always saved
Common Pattern:
# 1. Train baseline
ml-train
# 2. Experiment with one parameter
ml-train --lr 0.01
# 3. Combine best settings
ml-train --lr 0.01 --batch_size 32
# 4. Compare results
tensorboard --logdir runs/
Related Documentation¶
- Configuration Overview - How config system works
- All Configuration Options - What you can override
- Training Guide - Training workflows
- Hyperparameter Tuning - Systematic search
Quick Reference¶
# Most common overrides
--batch_size <int> # Batch size
--lr <float> # Learning rate
--num_epochs <int> # Training epochs
--data_dir <path> # Dataset location
--num_workers <int> # Data loading workers
--device <string> # GPU/CPU selection
--resume <path> # Resume training
# Examples
ml-train --batch_size 32
ml-train --lr 0.01 --num_epochs 50
ml-train --resume runs/base/last.pt