Data Configuration¶

Overview¶

The data section controls dataset loading and preprocessing parameters. These settings affect data loading speed, memory usage, and training throughput.

The framework uses index-based cross-validation splits to avoid data duplication while supporting k-fold cross-validation.

Configuration Parameters¶

data:
  dataset_name: <string>
  data_dir: <string>
  fold: <int>
  num_workers: <int>

`dataset_name`¶

Type: String
Default: 'hymenoptera'
Description: Name of the dataset (used in run directory naming)
Purpose: Distinguish between different datasets in run directories

Usage¶

data:
  dataset_name: 'hymenoptera'

CLI Override¶

ml-train --dataset_name my_dataset

Impact on Run Directory¶

The dataset name is automatically prepended to run directory names:

# Without dataset_name (old behavior):
# runs/base/
# runs/fold_1/
# runs/batch_32_lr_0.01/

# With dataset_name='hymenoptera':
# runs/hymenoptera_base/
# runs/hymenoptera_fold_1/
# runs/hymenoptera_batch_32_lr_0.01/

This prevents conflicts when training on multiple datasets.

Best Practices¶

Use short, descriptive names (e.g., 'cifar10', 'imagenet', 'custom_birds')
Avoid spaces and special characters (use underscores instead)
Keep consistent across experiments on the same dataset

`data_dir`¶

Type: String (path)
Default: 'data/hymenoptera_data'
Description: Path to dataset directory
Purpose: Specify location of dataset with raw images and split index files

Usage¶

data:
  data_dir: 'data/hymenoptera_data'

CLI Override¶

ml-train --data_dir data/my_dataset
ml-train --data_dir /mnt/shared/datasets/imagenet

Common Paths¶

Local data: data/my_dataset
Network storage: /mnt/shared/datasets/my_dataset
Absolute paths: /home/user/datasets/my_dataset

⚠️ CRITICAL: MANDATORY DIRECTORY STRUCTURE¶

This structure is NOT optional. The code WILL FAIL without it.

Your dataset MUST follow this exact hierarchy:

data_dir/
├── raw/                # REQUIRED: Original images organized by class
│   ├── class1/        # REQUIRED: One folder per class
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   └── ...
│   ├── class2/        # REQUIRED: Another class folder
│   │   ├── img1.jpg
│   │   └── ...
│   └── classN/        # REQUIRED: N class folders (N = num_classes)
│
└── splits/             # REQUIRED: Index files (generated by splitting.py)
    ├── test.txt           # Single test set (SAME for all folds)
    ├── fold_0_train.txt
    ├── fold_0_val.txt
    ├── fold_1_train.txt
    ├── fold_1_val.txt
    └── ...

Requirements (ALL MANDATORY)¶

✅ raw/ subdirectory: Contains all original images organized by class
✅ splits/ subdirectory: Contains index files generated by splitting.py
✅ Class folders: Each class in its own subdirectory under raw/
✅ Images in class folders: Images go directly inside class folders (no nested subdirectories)
✅ Matching num_classes: Number of class folders must equal model.num_classes in config

Index-Based Cross-Validation¶

The framework uses index files to reference images without duplication: - One copy of images in raw/ directory - Lightweight text files in splits/ that list which images belong to each fold/split - Reproducible - Same seed produces same splits - Git-friendly - Index files are small and can be committed

For complete details on data organization, see: Data Preparation Guide

`fold`¶

Type: Integer (≥ 0)
Default: 0
Description: Which cross-validation fold to use
Purpose: Select specific fold for training

Usage¶

data:
  fold: 0

CLI Override¶

ml-train --fold 0
ml-train --fold 1
ml-train --fold 2

Behavior¶

Determines which index files to load from data_dir/splits/: - fold: 0 → loads fold_0_train.txt, fold_0_val.txt, fold_0_test.txt - fold: 1 → loads fold_1_train.txt, fold_1_val.txt, fold_1_test.txt - fold: 2 → loads fold_2_train.txt, fold_2_val.txt, fold_2_test.txt

Run Naming¶

Fold number is automatically added to run directory name:

ml-train --fold 0
# Creates: runs/hymenoptera_base_fold_0/

ml-train --fold 2 --batch_size 32
# Creates: runs/hymenoptera_batch_32_fold_2/

ml-train --fold 1 --lr 0.01 --num_epochs 50
# Creates: runs/hymenoptera_lr_0.01_epochs_50_fold_1/

Generating Folds¶

Before training, generate splits using splitting.py:

ml-split \
  --raw_data data/my_dataset/raw \
  --output data/my_dataset/splits \
  --folds 5 \
  --ratio 0.7 0.15 0.15 \
  --seed 42

This creates: - fold_0_*.txt, fold_1_*.txt, ..., fold_4_*.txt - Each fold has different train/val/test splits - Use different folds for cross-validation

Cross-Validation Workflow¶

# Train all 5 folds with same hyperparameters
for fold in {0..4}; do
  ml-train --fold $fold --batch_size 32 --lr 0.01 --num_epochs 100
done

# Average results across folds for final performance estimate

`num_workers`¶

Type: Integer (≥ 0)
Default: 4
Description: Number of subprocesses for data loading
Purpose: Parallelize data loading to prevent GPU starvation

Usage¶

data:
  num_workers: 4

CLI Override¶

ml-train --num_workers 8

Performance Guidance¶

Setting	Use Case	Notes
`0`	Debugging, small datasets	Single-threaded, easier to debug
`2-4`	General use, consumer hardware	Good balance
`4-8`	High-performance training	For systems with many CPU cores
`8+`	Large-scale training	May see diminishing returns

Choosing the Right Value¶

Factors to Consider: - CPU core count (don't exceed available cores) - RAM availability (each worker loads data in memory) - Disk I/O (too many workers can bottleneck disk) - Batch size (larger batches benefit more from parallel loading)

Finding Optimal Value:

Start with 4 (default)
Monitor GPU utilization: watch -n 1 nvidia-smi
If GPU utilization < 90%, increase num_workers
If GPU utilization near 100%, value is good
If system becomes unresponsive, decrease num_workers

Performance Impact¶

num_workers=0:  [####------] 40% GPU utilization (GPU waiting for data)
num_workers=2:  [#######---] 70% GPU utilization
num_workers=4:  [##########] 98% GPU utilization (optimal)
num_workers=8:  [##########] 98% GPU utilization (no improvement, wastes CPU)

Troubleshooting¶

Problem: GPU utilization low (<70%) - Solution: Increase num_workers - GPU is starving for data

Problem: System unresponsive, high RAM usage - Solution: Decrease num_workers - Too many workers loading data in parallel

Problem: "Too many open files" error - Solution: Increase system file descriptor limit

ulimit -n 4096

Special Cases¶

CPU-only training: - Use 2-4 workers (lower than GPU training) - CPU is handling both training and data loading

Small datasets (< 1000 images): - Use 0-2 workers - Overhead of multiprocessing not worth it

Large images (> 2MB each): - Use 2-4 workers (lower than normal) - Each worker consumes more memory

Network/cloud storage: - Use 2-4 workers (lower than local disk) - Network I/O may be bottleneck

Complete Examples¶

Example 1: Local Dataset with CV¶

data:
  dataset_name: 'my_dataset'
  data_dir: 'data/my_dataset'  # Contains raw/ and splits/
  fold: 0
  num_workers: 4

# Generate splits first (one time)
ml-split --raw_data data/my_dataset/raw --output data/my_dataset/splits --folds 5

# Train fold 0
ml-train --fold 0

# Train fold 1
ml-train --fold 1

Example 2: Network Storage¶

data:
  dataset_name: 'imagenet'
  data_dir: '/mnt/nfs/shared_datasets/imagenet'  # Contains raw/ and splits/
  fold: 0
  num_workers: 2  # Lower due to network I/O

Example 3: Multiple Datasets¶

# config_hymenoptera.yaml
data:
  dataset_name: 'hymenoptera'
  data_dir: 'data/hymenoptera_data'
  fold: 0

# config_cifar.yaml
data:
  dataset_name: 'cifar10'
  data_dir: 'data/cifar10'
  fold: 0

# Train on different datasets
ml-train --config config_hymenoptera.yaml  # runs/hymenoptera_base/
ml-train --config config_cifar.yaml        # runs/cifar10_base/

Example 4: Debugging¶

data:
  dataset_name: 'test'
  data_dir: 'data/test_dataset'  # Contains raw/ and splits/
  fold: 0
  num_workers: 0  # Single-threaded for easier debugging

Best Practices¶

Use descriptive dataset names - Makes run directories self-documenting
Generate splits once - Reuse same splits for reproducibility
Start with defaults (num_workers: 4)
Monitor GPU utilization during training
Commit splits to git - Index files are small and ensure reproducibility
Document split generation - Save the exact splitting.py command used
Keep raw/ immutable - Never modify after generating splits

Data Preparation Guide - Complete guide to organizing datasets
Training Configuration - Related training parameters
CLI Overrides - How to override via command line
Performance Tuning - Optimize training speed

Data Configuration¶

Overview¶

Configuration Parameters¶

dataset_name¶

Usage¶

CLI Override¶

Impact on Run Directory¶

Best Practices¶

data_dir¶

Usage¶

CLI Override¶

Common Paths¶

⚠️ CRITICAL: MANDATORY DIRECTORY STRUCTURE¶

Requirements (ALL MANDATORY)¶

Index-Based Cross-Validation¶

fold¶

Usage¶

CLI Override¶

Behavior¶

Run Naming¶

Generating Folds¶

Cross-Validation Workflow¶

num_workers¶

Usage¶

CLI Override¶

Performance Guidance¶

Choosing the Right Value¶

Performance Impact¶

Troubleshooting¶

Special Cases¶

Complete Examples¶

Example 1: Local Dataset with CV¶

Example 2: Network Storage¶

Example 3: Multiple Datasets¶

Example 4: Debugging¶

Best Practices¶

Related Configuration¶

Further Reading¶

`dataset_name`¶

`data_dir`¶

`fold`¶

`num_workers`¶