Skip to content

Data Configuration

Overview

The data section controls dataset loading and preprocessing parameters. These settings affect data loading speed, memory usage, and training throughput.

The framework uses index-based cross-validation splits to avoid data duplication while supporting k-fold cross-validation.

Configuration Parameters

data:
  dataset_name: <string>
  data_dir: <string>
  fold: <int>
  num_workers: <int>

dataset_name

  • Type: String
  • Default: 'hymenoptera'
  • Description: Name of the dataset (used in run directory naming)
  • Purpose: Distinguish between different datasets in run directories

Usage

data:
  dataset_name: 'hymenoptera'

CLI Override

ml-train --dataset_name my_dataset

Impact on Run Directory

The dataset name is automatically prepended to run directory names:

# Without dataset_name (old behavior):
# runs/base/
# runs/fold_1/
# runs/batch_32_lr_0.01/

# With dataset_name='hymenoptera':
# runs/hymenoptera_base/
# runs/hymenoptera_fold_1/
# runs/hymenoptera_batch_32_lr_0.01/

This prevents conflicts when training on multiple datasets.

Best Practices

  • Use short, descriptive names (e.g., 'cifar10', 'imagenet', 'custom_birds')
  • Avoid spaces and special characters (use underscores instead)
  • Keep consistent across experiments on the same dataset

data_dir

  • Type: String (path)
  • Default: 'data/hymenoptera_data'
  • Description: Path to dataset directory
  • Purpose: Specify location of dataset with raw images and split index files

Usage

data:
  data_dir: 'data/hymenoptera_data'

CLI Override

ml-train --data_dir data/my_dataset
ml-train --data_dir /mnt/shared/datasets/imagenet

Common Paths

  • Local data: data/my_dataset
  • Network storage: /mnt/shared/datasets/my_dataset
  • Absolute paths: /home/user/datasets/my_dataset

⚠️ CRITICAL: MANDATORY DIRECTORY STRUCTURE

This structure is NOT optional. The code WILL FAIL without it.

Your dataset MUST follow this exact hierarchy:

data_dir/
├── raw/                # REQUIRED: Original images organized by class
│   ├── class1/        # REQUIRED: One folder per class
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   └── ...
│   ├── class2/        # REQUIRED: Another class folder
│   │   ├── img1.jpg
│   │   └── ...
│   └── classN/        # REQUIRED: N class folders (N = num_classes)
└── splits/             # REQUIRED: Index files (generated by splitting.py)
    ├── test.txt           # Single test set (SAME for all folds)
    ├── fold_0_train.txt
    ├── fold_0_val.txt
    ├── fold_1_train.txt
    ├── fold_1_val.txt
    └── ...

Requirements (ALL MANDATORY)

  1. raw/ subdirectory: Contains all original images organized by class
  2. splits/ subdirectory: Contains index files generated by splitting.py
  3. Class folders: Each class in its own subdirectory under raw/
  4. Images in class folders: Images go directly inside class folders (no nested subdirectories)
  5. Matching num_classes: Number of class folders must equal model.num_classes in config

Index-Based Cross-Validation

The framework uses index files to reference images without duplication: - One copy of images in raw/ directory - Lightweight text files in splits/ that list which images belong to each fold/split - Reproducible - Same seed produces same splits - Git-friendly - Index files are small and can be committed

For complete details on data organization, see: Data Preparation Guide


fold

  • Type: Integer (≥ 0)
  • Default: 0
  • Description: Which cross-validation fold to use
  • Purpose: Select specific fold for training

Usage

data:
  fold: 0

CLI Override

ml-train --fold 0
ml-train --fold 1
ml-train --fold 2

Behavior

Determines which index files to load from data_dir/splits/: - fold: 0 → loads fold_0_train.txt, fold_0_val.txt, fold_0_test.txt - fold: 1 → loads fold_1_train.txt, fold_1_val.txt, fold_1_test.txt - fold: 2 → loads fold_2_train.txt, fold_2_val.txt, fold_2_test.txt

Run Naming

Fold number is automatically added to run directory name:

ml-train --fold 0
# Creates: runs/hymenoptera_base_fold_0/

ml-train --fold 2 --batch_size 32
# Creates: runs/hymenoptera_batch_32_fold_2/

ml-train --fold 1 --lr 0.01 --num_epochs 50
# Creates: runs/hymenoptera_lr_0.01_epochs_50_fold_1/

Generating Folds

Before training, generate splits using splitting.py:

ml-split \
  --raw_data data/my_dataset/raw \
  --output data/my_dataset/splits \
  --folds 5 \
  --ratio 0.7 0.15 0.15 \
  --seed 42

This creates: - fold_0_*.txt, fold_1_*.txt, ..., fold_4_*.txt - Each fold has different train/val/test splits - Use different folds for cross-validation

Cross-Validation Workflow

# Train all 5 folds with same hyperparameters
for fold in {0..4}; do
  ml-train --fold $fold --batch_size 32 --lr 0.01 --num_epochs 100
done

# Average results across folds for final performance estimate

num_workers

  • Type: Integer (≥ 0)
  • Default: 4
  • Description: Number of subprocesses for data loading
  • Purpose: Parallelize data loading to prevent GPU starvation

Usage

data:
  num_workers: 4

CLI Override

ml-train --num_workers 8

Performance Guidance

Setting Use Case Notes
0 Debugging, small datasets Single-threaded, easier to debug
2-4 General use, consumer hardware Good balance
4-8 High-performance training For systems with many CPU cores
8+ Large-scale training May see diminishing returns

Choosing the Right Value

Factors to Consider: - CPU core count (don't exceed available cores) - RAM availability (each worker loads data in memory) - Disk I/O (too many workers can bottleneck disk) - Batch size (larger batches benefit more from parallel loading)

Finding Optimal Value:

  1. Start with 4 (default)
  2. Monitor GPU utilization: watch -n 1 nvidia-smi
  3. If GPU utilization < 90%, increase num_workers
  4. If GPU utilization near 100%, value is good
  5. If system becomes unresponsive, decrease num_workers

Performance Impact

num_workers=0:  [####------] 40% GPU utilization (GPU waiting for data)
num_workers=2:  [#######---] 70% GPU utilization
num_workers=4:  [##########] 98% GPU utilization (optimal)
num_workers=8:  [##########] 98% GPU utilization (no improvement, wastes CPU)

Troubleshooting

Problem: GPU utilization low (<70%) - Solution: Increase num_workers - GPU is starving for data

Problem: System unresponsive, high RAM usage - Solution: Decrease num_workers - Too many workers loading data in parallel

Problem: "Too many open files" error - Solution: Increase system file descriptor limit

ulimit -n 4096

Special Cases

CPU-only training: - Use 2-4 workers (lower than GPU training) - CPU is handling both training and data loading

Small datasets (< 1000 images): - Use 0-2 workers - Overhead of multiprocessing not worth it

Large images (> 2MB each): - Use 2-4 workers (lower than normal) - Each worker consumes more memory

Network/cloud storage: - Use 2-4 workers (lower than local disk) - Network I/O may be bottleneck


Complete Examples

Example 1: Local Dataset with CV

data:
  dataset_name: 'my_dataset'
  data_dir: 'data/my_dataset'  # Contains raw/ and splits/
  fold: 0
  num_workers: 4
# Generate splits first (one time)
ml-split --raw_data data/my_dataset/raw --output data/my_dataset/splits --folds 5

# Train fold 0
ml-train --fold 0

# Train fold 1
ml-train --fold 1

Example 2: Network Storage

data:
  dataset_name: 'imagenet'
  data_dir: '/mnt/nfs/shared_datasets/imagenet'  # Contains raw/ and splits/
  fold: 0
  num_workers: 2  # Lower due to network I/O

Example 3: Multiple Datasets

# config_hymenoptera.yaml
data:
  dataset_name: 'hymenoptera'
  data_dir: 'data/hymenoptera_data'
  fold: 0

# config_cifar.yaml
data:
  dataset_name: 'cifar10'
  data_dir: 'data/cifar10'
  fold: 0
# Train on different datasets
ml-train --config config_hymenoptera.yaml  # runs/hymenoptera_base/
ml-train --config config_cifar.yaml        # runs/cifar10_base/

Example 4: Debugging

data:
  dataset_name: 'test'
  data_dir: 'data/test_dataset'  # Contains raw/ and splits/
  fold: 0
  num_workers: 0  # Single-threaded for easier debugging

Best Practices

  1. Use descriptive dataset names - Makes run directories self-documenting
  2. Generate splits once - Reuse same splits for reproducibility
  3. Start with defaults (num_workers: 4)
  4. Monitor GPU utilization during training
  5. Commit splits to git - Index files are small and ensure reproducibility
  6. Document split generation - Save the exact splitting.py command used
  7. Keep raw/ immutable - Never modify after generating splits


Further Reading