Skip to content

Troubleshooting Guide

Common issues and solutions.

Installation Issues

"torch not found"

# Install the package with dependencies
uv pip install -e .

CUDA Not Available

# Check CUDA
python -c "import torch; print(torch.cuda.is_available())"

# If False, train on CPU
ml-train --device cpu

Training Issues

Out of Memory

Error: RuntimeError: CUDA out of memory

Solutions:

# 1. Reduce batch size
ml-train --batch_size 8

# 2. Reduce image size (in config.yaml)
transforms:
  train:
    resize: [128, 128]

# 3. Use CPU
ml-train --device cpu

Loss is NaN

Symptom: Loss becomes NaN during training

Solutions: 1. Lower learning rate: ml-train --lr 0.0001 2. Check data normalization 3. Verify labels are correct 4. Use gradient clipping (code modification needed)

Training Too Slow

Solutions:

# 1. Increase workers
ml-train --num_workers 8

# 2. Larger batch size
ml-train --batch_size 64

# 3. Check GPU utilization
nvidia-smi

Low Accuracy

Solutions: 1. Train more epochs 2. Use pretrained weights: weights: 'DEFAULT' 3. Increase model capacity 4. Check data augmentation 5. Verify dataset quality

Data Issues

"Found 0 files" or "Index file not found"

Error: RuntimeError: Found 0 files in raw/ or FileNotFoundError: Index file not found

Cause: Incorrect directory structure or missing split files

Solution: Organize data properly and generate splits:

data/my_dataset/
├── raw/               # Your images organized by class
│   ├── class1/
│   │   ├── img1.jpg
│   │   └── ...
│   └── class2/
│       ├── img2.jpg
│       └── ...
└── splits/            # Generated by ml-split
    ├── test.txt
    ├── fold_0_train.txt
    ├── fold_0_val.txt
    └── ...

Generate splits:

ml-split --raw_data data/my_dataset/raw --folds 5

See: Data Preparation

Class Mismatch

Symptom: Weird metrics, confusion matrix wrong

Cause: Different class names in train/val/test

Solution: Ensure identical class folder names across all splits

Resumption Issues

Checkpoint Not Found

# Check file exists
ls runs/base/weights/

# Correct path
ml-train --resume runs/base/weights/last.pt

Device Mismatch

Problem: Trained on GPU, can't resume on CPU

Solution: Load checkpoint explicitly to target device (code modification needed)

Configuration Issues

Override Not Working

# Wrong argument name
ml-train --learning_rate 0.01  # ❌

# Correct
ml-train --lr 0.01  # ✅

Config File Not Found

# Use absolute path
ml-train --config /full/path/to/config.yaml

TensorBoard Issues

Port Already in Use

# Use different port
tensorboard --logdir runs/ --port 6007

No Data Shown

Cause: TensorBoard looking in wrong directory

Solution:

# Point to correct directory
tensorboard --logdir runs/base/tensorboard

Quick Diagnostic Commands

# Check Python/PyTorch
python -c "import torch; print(torch.__version__)"

# Check CUDA
nvidia-smi

# Check dataset structure
tree -L 3 data/my_dataset/

# Check GPU usage during training
watch -n 1 nvidia-smi

# View training log
cat runs/base/logs/train.log

# Check configuration
cat runs/base/config.yaml

Getting Help

  1. Check error message carefully
  2. Review relevant documentation section
  3. Check system info:
    python -c "
    import sys, torch
    print(f'Python: {sys.version}')
    print(f'PyTorch: {torch.__version__}')
    print(f'CUDA: {torch.cuda.is_available()}')
    "