Troubleshooting Guide¶
Common issues and solutions.
Installation Issues¶
"torch not found"¶
CUDA Not Available¶
# Check CUDA
python -c "import torch; print(torch.cuda.is_available())"
# If False, train on CPU
ml-train --device cpu
Training Issues¶
Out of Memory¶
Error: RuntimeError: CUDA out of memory
Solutions:
# 1. Reduce batch size
ml-train --batch_size 8
# 2. Reduce image size (in config.yaml)
transforms:
train:
resize: [128, 128]
# 3. Use CPU
ml-train --device cpu
Loss is NaN¶
Symptom: Loss becomes NaN during training
Solutions:
1. Lower learning rate: ml-train --lr 0.0001
2. Check data normalization
3. Verify labels are correct
4. Use gradient clipping (code modification needed)
Training Too Slow¶
Solutions:
# 1. Increase workers
ml-train --num_workers 8
# 2. Larger batch size
ml-train --batch_size 64
# 3. Check GPU utilization
nvidia-smi
Low Accuracy¶
Solutions:
1. Train more epochs
2. Use pretrained weights: weights: 'DEFAULT'
3. Increase model capacity
4. Check data augmentation
5. Verify dataset quality
Data Issues¶
"Found 0 files" or "Index file not found"¶
Error: RuntimeError: Found 0 files in raw/ or FileNotFoundError: Index file not found
Cause: Incorrect directory structure or missing split files
Solution: Organize data properly and generate splits:
data/my_dataset/
├── raw/ # Your images organized by class
│ ├── class1/
│ │ ├── img1.jpg
│ │ └── ...
│ └── class2/
│ ├── img2.jpg
│ └── ...
└── splits/ # Generated by ml-split
├── test.txt
├── fold_0_train.txt
├── fold_0_val.txt
└── ...
Generate splits:
See: Data Preparation
Class Mismatch¶
Symptom: Weird metrics, confusion matrix wrong
Cause: Different class names in train/val/test
Solution: Ensure identical class folder names across all splits
Resumption Issues¶
Checkpoint Not Found¶
# Check file exists
ls runs/base/weights/
# Correct path
ml-train --resume runs/base/weights/last.pt
Device Mismatch¶
Problem: Trained on GPU, can't resume on CPU
Solution: Load checkpoint explicitly to target device (code modification needed)
Configuration Issues¶
Override Not Working¶
Config File Not Found¶
TensorBoard Issues¶
Port Already in Use¶
No Data Shown¶
Cause: TensorBoard looking in wrong directory
Solution:
Quick Diagnostic Commands¶
# Check Python/PyTorch
python -c "import torch; print(torch.__version__)"
# Check CUDA
nvidia-smi
# Check dataset structure
tree -L 3 data/my_dataset/
# Check GPU usage during training
watch -n 1 nvidia-smi
# View training log
cat runs/base/logs/train.log
# Check configuration
cat runs/base/config.yaml
Getting Help¶
- Check error message carefully
- Review relevant documentation section
- Check system info: