Data Configuration¶
Overview¶
The data section controls dataset loading and preprocessing parameters. These settings affect data loading speed, memory usage, and training throughput.
The framework uses index-based cross-validation splits to avoid data duplication while supporting k-fold cross-validation.
Configuration Parameters¶
dataset_name¶
- Type: String
- Default:
'hymenoptera' - Description: Name of the dataset (used in run directory naming)
- Purpose: Distinguish between different datasets in run directories
Usage¶
CLI Override¶
Impact on Run Directory¶
The dataset name is automatically prepended to run directory names:
# Without dataset_name (old behavior):
# runs/base/
# runs/fold_1/
# runs/batch_32_lr_0.01/
# With dataset_name='hymenoptera':
# runs/hymenoptera_base/
# runs/hymenoptera_fold_1/
# runs/hymenoptera_batch_32_lr_0.01/
This prevents conflicts when training on multiple datasets.
Best Practices¶
- Use short, descriptive names (e.g.,
'cifar10','imagenet','custom_birds') - Avoid spaces and special characters (use underscores instead)
- Keep consistent across experiments on the same dataset
data_dir¶
- Type: String (path)
- Default:
'data/hymenoptera_data' - Description: Path to dataset directory
- Purpose: Specify location of dataset with raw images and split index files
Usage¶
CLI Override¶
Common Paths¶
- Local data:
data/my_dataset - Network storage:
/mnt/shared/datasets/my_dataset - Absolute paths:
/home/user/datasets/my_dataset
⚠️ CRITICAL: MANDATORY DIRECTORY STRUCTURE¶
This structure is NOT optional. The code WILL FAIL without it.
Your dataset MUST follow this exact hierarchy:
data_dir/
├── raw/ # REQUIRED: Original images organized by class
│ ├── class1/ # REQUIRED: One folder per class
│ │ ├── img1.jpg
│ │ ├── img2.jpg
│ │ └── ...
│ ├── class2/ # REQUIRED: Another class folder
│ │ ├── img1.jpg
│ │ └── ...
│ └── classN/ # REQUIRED: N class folders (N = num_classes)
│
└── splits/ # REQUIRED: Index files (generated by splitting.py)
├── test.txt # Single test set (SAME for all folds)
├── fold_0_train.txt
├── fold_0_val.txt
├── fold_1_train.txt
├── fold_1_val.txt
└── ...
Requirements (ALL MANDATORY)¶
- ✅ raw/ subdirectory: Contains all original images organized by class
- ✅ splits/ subdirectory: Contains index files generated by
splitting.py - ✅ Class folders: Each class in its own subdirectory under
raw/ - ✅ Images in class folders: Images go directly inside class folders (no nested subdirectories)
- ✅ Matching num_classes: Number of class folders must equal
model.num_classesin config
Index-Based Cross-Validation¶
The framework uses index files to reference images without duplication:
- One copy of images in raw/ directory
- Lightweight text files in splits/ that list which images belong to each fold/split
- Reproducible - Same seed produces same splits
- Git-friendly - Index files are small and can be committed
For complete details on data organization, see: Data Preparation Guide
fold¶
- Type: Integer (≥ 0)
- Default:
0 - Description: Which cross-validation fold to use
- Purpose: Select specific fold for training
Usage¶
CLI Override¶
Behavior¶
Determines which index files to load from data_dir/splits/:
- fold: 0 → loads fold_0_train.txt, fold_0_val.txt, fold_0_test.txt
- fold: 1 → loads fold_1_train.txt, fold_1_val.txt, fold_1_test.txt
- fold: 2 → loads fold_2_train.txt, fold_2_val.txt, fold_2_test.txt
Run Naming¶
Fold number is automatically added to run directory name:
ml-train --fold 0
# Creates: runs/hymenoptera_base_fold_0/
ml-train --fold 2 --batch_size 32
# Creates: runs/hymenoptera_batch_32_fold_2/
ml-train --fold 1 --lr 0.01 --num_epochs 50
# Creates: runs/hymenoptera_lr_0.01_epochs_50_fold_1/
Generating Folds¶
Before training, generate splits using splitting.py:
ml-split \
--raw_data data/my_dataset/raw \
--output data/my_dataset/splits \
--folds 5 \
--ratio 0.7 0.15 0.15 \
--seed 42
This creates:
- fold_0_*.txt, fold_1_*.txt, ..., fold_4_*.txt
- Each fold has different train/val/test splits
- Use different folds for cross-validation
Cross-Validation Workflow¶
# Train all 5 folds with same hyperparameters
for fold in {0..4}; do
ml-train --fold $fold --batch_size 32 --lr 0.01 --num_epochs 100
done
# Average results across folds for final performance estimate
num_workers¶
- Type: Integer (≥ 0)
- Default:
4 - Description: Number of subprocesses for data loading
- Purpose: Parallelize data loading to prevent GPU starvation
Usage¶
CLI Override¶
Performance Guidance¶
| Setting | Use Case | Notes |
|---|---|---|
0 |
Debugging, small datasets | Single-threaded, easier to debug |
2-4 |
General use, consumer hardware | Good balance |
4-8 |
High-performance training | For systems with many CPU cores |
8+ |
Large-scale training | May see diminishing returns |
Choosing the Right Value¶
Factors to Consider: - CPU core count (don't exceed available cores) - RAM availability (each worker loads data in memory) - Disk I/O (too many workers can bottleneck disk) - Batch size (larger batches benefit more from parallel loading)
Finding Optimal Value:
- Start with
4(default) - Monitor GPU utilization:
watch -n 1 nvidia-smi - If GPU utilization < 90%, increase
num_workers - If GPU utilization near 100%, value is good
- If system becomes unresponsive, decrease
num_workers
Performance Impact¶
num_workers=0: [####------] 40% GPU utilization (GPU waiting for data)
num_workers=2: [#######---] 70% GPU utilization
num_workers=4: [##########] 98% GPU utilization (optimal)
num_workers=8: [##########] 98% GPU utilization (no improvement, wastes CPU)
Troubleshooting¶
Problem: GPU utilization low (<70%)
- Solution: Increase num_workers
- GPU is starving for data
Problem: System unresponsive, high RAM usage
- Solution: Decrease num_workers
- Too many workers loading data in parallel
Problem: "Too many open files" error - Solution: Increase system file descriptor limit
Special Cases¶
CPU-only training:
- Use 2-4 workers (lower than GPU training)
- CPU is handling both training and data loading
Small datasets (< 1000 images):
- Use 0-2 workers
- Overhead of multiprocessing not worth it
Large images (> 2MB each):
- Use 2-4 workers (lower than normal)
- Each worker consumes more memory
Network/cloud storage:
- Use 2-4 workers (lower than local disk)
- Network I/O may be bottleneck
Complete Examples¶
Example 1: Local Dataset with CV¶
data:
dataset_name: 'my_dataset'
data_dir: 'data/my_dataset' # Contains raw/ and splits/
fold: 0
num_workers: 4
# Generate splits first (one time)
ml-split --raw_data data/my_dataset/raw --output data/my_dataset/splits --folds 5
# Train fold 0
ml-train --fold 0
# Train fold 1
ml-train --fold 1
Example 2: Network Storage¶
data:
dataset_name: 'imagenet'
data_dir: '/mnt/nfs/shared_datasets/imagenet' # Contains raw/ and splits/
fold: 0
num_workers: 2 # Lower due to network I/O
Example 3: Multiple Datasets¶
# config_hymenoptera.yaml
data:
dataset_name: 'hymenoptera'
data_dir: 'data/hymenoptera_data'
fold: 0
# config_cifar.yaml
data:
dataset_name: 'cifar10'
data_dir: 'data/cifar10'
fold: 0
# Train on different datasets
ml-train --config config_hymenoptera.yaml # runs/hymenoptera_base/
ml-train --config config_cifar.yaml # runs/cifar10_base/
Example 4: Debugging¶
data:
dataset_name: 'test'
data_dir: 'data/test_dataset' # Contains raw/ and splits/
fold: 0
num_workers: 0 # Single-threaded for easier debugging
Best Practices¶
- Use descriptive dataset names - Makes run directories self-documenting
- Generate splits once - Reuse same splits for reproducibility
- Start with defaults (
num_workers: 4) - Monitor GPU utilization during training
- Commit splits to git - Index files are small and ensure reproducibility
- Document split generation - Save the exact
splitting.pycommand used - Keep raw/ immutable - Never modify after generating splits
Related Configuration¶
- Data Preparation Guide - Complete guide to organizing datasets
- Training Configuration - Related training parameters
- CLI Overrides - How to override via command line
- Performance Tuning - Optimize training speed