Performance Tuning¶

Optimize training speed and memory usage.

Training Speed¶

Mixed Precision Training (2-3x Faster) ⭐¶

The #1 performance optimization - Use mixed precision for 2-3x speedup on modern GPUs.

training:
  trainer_type: 'mixed_precision'
  amp_dtype: 'float16'  # or 'bfloat16' for A100/RTX 40 series

Benefits: - 2-3x faster training - ~50% memory reduction - Minimal accuracy impact - No code changes needed

Requirements: - NVIDIA GPU (Volta/Turing/Ampere or newer: GTX 1080 Ti, RTX 20/30/40 series, A100) - CUDA support

When to use: - Single GPU training (most common) - Want maximum speed without multi-GPU complexity - Recommended for all production training

See: Advanced Training Guide

Batch Size¶

# Larger batch = faster (if GPU memory allows)
ml-train --batch_size 64

Data Loading¶

# More workers = faster data loading
ml-train --num_workers 8

Determinism¶

# Non-deterministic = faster
deterministic: false  # (default)

Image Size¶

# Smaller images = faster
transforms:
  train:
    resize: [128, 128]  # Instead of [224, 224]

Memory Usage¶

Reduce Batch Size¶

ml-train --batch_size 8

Reduce Image Resolution¶

transforms:
  train:
    resize: [192, 192]

Reduce Workers¶

ml-train --num_workers 2

Smaller Model¶

model:
  architecture: 'mobilenet_v3_small'

GPU Optimization¶

Check Utilization¶

# Should be near 100%
watch -n 1 nvidia-smi

Maximize GPU Usage¶

Increase batch size until OOM
Then reduce slightly
Verify 95%+ utilization

Multiple GPUs¶

Multi-GPU training is supported via Hugging Face Accelerate:

# One-time setup
uv pip install accelerate
accelerate config

# Launch multi-GPU training
accelerate launch ml-train --config configs/my_config.yaml

Configuration:

training:
  trainer_type: 'accelerate'
  batch_size: 32  # Per-device batch size

Benefits: - ~3.5x faster with 2 GPUs - Linear scaling with more GPUs - Supports distributed training

See: Advanced Training Guide

Disk I/O¶

Use SSD¶

Faster than HDD for image loading.

Reduce Image Size¶

Smaller files load faster.

Cache Preprocessing¶

For repeated experiments (code modification needed).

Profiling¶

Time Bottlenecks¶

import time

start = time.time()
# ... training code ...
print(f"Epoch time: {time.time() - start:.2f}s")

PyTorch Profiler¶

For advanced profiling (code modification needed).

Benchmarks¶

Example training speed (ResNet18, batch 32): - GPU (RTX 3090): ~0.5s/epoch - GPU (GTX 1080): ~1.5s/epoch - CPU: ~60s/epoch