Performance Tuning¶
Optimize training speed and memory usage.
Training Speed¶
Mixed Precision Training (2-3x Faster) ⭐¶
The #1 performance optimization - Use mixed precision for 2-3x speedup on modern GPUs.
training:
trainer_type: 'mixed_precision'
amp_dtype: 'float16' # or 'bfloat16' for A100/RTX 40 series
Benefits: - 2-3x faster training - ~50% memory reduction - Minimal accuracy impact - No code changes needed
Requirements: - NVIDIA GPU (Volta/Turing/Ampere or newer: GTX 1080 Ti, RTX 20/30/40 series, A100) - CUDA support
When to use: - Single GPU training (most common) - Want maximum speed without multi-GPU complexity - Recommended for all production training
Batch Size¶
Data Loading¶
Determinism¶
Image Size¶
Memory Usage¶
Reduce Batch Size¶
Reduce Image Resolution¶
Reduce Workers¶
Smaller Model¶
GPU Optimization¶
Check Utilization¶
Maximize GPU Usage¶
- Increase batch size until OOM
- Then reduce slightly
- Verify 95%+ utilization
Multiple GPUs¶
Multi-GPU training is supported via Hugging Face Accelerate:
# One-time setup
uv pip install accelerate
accelerate config
# Launch multi-GPU training
accelerate launch ml-train --config configs/my_config.yaml
Configuration:
Benefits: - ~3.5x faster with 2 GPUs - Linear scaling with more GPUs - Supports distributed training
Disk I/O¶
Use SSD¶
Faster than HDD for image loading.
Reduce Image Size¶
Smaller files load faster.
Cache Preprocessing¶
For repeated experiments (code modification needed).
Profiling¶
Time Bottlenecks¶
import time
start = time.time()
# ... training code ...
print(f"Epoch time: {time.time() - start:.2f}s")
PyTorch Profiler¶
For advanced profiling (code modification needed).
Benchmarks¶
Example training speed (ResNet18, batch 32): - GPU (RTX 3090): ~0.5s/epoch - GPU (GTX 1080): ~1.5s/epoch - CPU: ~60s/epoch