Skip to content

Frequently Asked Questions (FAQ)

Common questions and quick answers.


General

Q: What models are supported?

A: All torchvision models (ResNet, EfficientNet, ViT, etc.) are automatically supported. You can also create custom models.

See: Model Configuration


Q: Can I use my own dataset?

A: Yes! Just organize it in the required train/val/test structure with class subfolders.

See: Data Preparation


Q: Do I need a GPU?

A: No, but highly recommended. You can train on CPU with --device cpu, but it will be much slower.


Q: How do I resume interrupted training?

A: Use the --resume flag:

ml-train --resume runs/base/last.pt

See: Resuming Training


Configuration

Q: How do I change the learning rate?

A: Use CLI override or edit config:

ml-train --lr 0.01

Or in ml_src/config_template.yaml:

optimizer:
  lr: 0.01


Q: What's the difference between best.pt and last.pt?

A: - best.pt - Model with highest validation accuracy (use for deployment/inference) - last.pt - Latest epoch checkpoint (use to resume training)


Q: Can I use pretrained weights?

A: Yes, set weights: 'DEFAULT' in config:

model:
  type: 'base'
  architecture: 'resnet18'
  weights: 'DEFAULT'  # ImageNet pretrained


Q: How do I change batch size?

A:

ml-train --batch_size 32

Start small (8-16) and increase until GPU memory is full.


Data & Cross-Validation

Q: What data structure is required?

A: The framework uses an index-based structure where raw images are stored once and referenced by text files for each split.

Mandatory structure:

data/your_dataset/
├── raw/
│   ├── class1/
│   └── class2/
└── splits/
    ├── test.txt
    ├── fold_0_train.txt
    └── fold_0_val.txt
You must run ml-split to generate the splits/ directory. See the Data Preparation Guide for details.


Q: How does cross-validation work with test sets?

A: Test set is held out ONCE and is the SAME for all folds: - ✅ Test set: 15% held out once (test.txt) - SAME for all folds - ✅ Train/Val: Remaining 85% split across k folds (varies per fold) - ✅ Fair comparison: All folds evaluated on identical test set

Why not different test sets per fold? Different test sets would make fold results incomparable. Our approach ensures consistent evaluation.


Q: Where is my test data after running ml-split?

A: Check data/my_dataset/splits/test.txt (single file, no fold number). - ✅ test.txt - Same for all folds - ✅ fold_0_train.txt, fold_0_val.txt - Vary per fold - ✅ fold_1_train.txt, fold_1_val.txt - Vary per fold


Training

Q: How many epochs should I train for?

A: - Quick test: 3-5 epochs - Small datasets: 25-50 epochs - Large datasets: 50-100 epochs - From scratch: 200+ epochs

Monitor validation curves and stop when they plateau.


Q: Training is very slow. What can I do?

A: 1. Increase batch size: ml-train --batch_size 64 2. Increase workers: ml-train --num_workers 8 3. Use smaller images (edit transforms in config) 4. Use faster model: efficientnet_b0 or mobilenet_v3_small

See: Performance Tuning


Q: I'm getting "CUDA out of memory" errors. Help?

A: 1. Reduce batch size: ml-train --batch_size 8 2. Reduce image size (edit config transforms) 3. Use smaller model 4. Train on CPU: ml-train --device cpu

See: Troubleshooting


Q: How do I know if my model is training well?

A: - Training loss should decrease - Validation accuracy should increase - Gap between train/val not too large (indicates overfitting) - Check TensorBoard: tensorboard --logdir runs/


Models

Q: Which model should I use?

A: - Beginner: resnet18 with weights: 'DEFAULT' - Best accuracy: efficientnet_b7 or vit_b_16 - Fast training: efficientnet_b0 or mobilenet_v3_small - Mobile/Edge: mobilenet_v3_small

See: Model Configuration


Q: Can I create my own model architecture?

A: Yes! Add your model to ml_src/network/custom.py and register it.

See: Adding Custom Models


Q: How do I switch models?

A: Edit config (generated by ml-init-config):

model:
  architecture: 'efficientnet_b0'  # Change this

Note: Existing checkpoints won't work (different architecture).


Inference

Q: How do I run inference on test data?

A: Test evaluation is automatic after training! Results are saved to: - runs/{run_name}/logs/classification_report_test.txt - TensorBoard (confusion matrix, metrics)

For manual inference:

ml-inference runs/base/weights/best.pt

See: Inference Guide


Q: Can I test on a single image?

A: Not directly supported. You'd need to modify the inference module or create a new script.


Monitoring

Q: How do I view training metrics?

A: Use TensorBoard:

tensorboard --logdir runs/
# Open http://localhost:6006

See: Monitoring Guide


Q: What metrics are tracked?

A: - Training/validation loss - Training/validation accuracy - Learning rate schedule - Confusion matrices - Per-class precision/recall/F1


Q: Where are the logs?

A: - Training log: runs/{run_name}/logs/train.log - Summary: runs/{run_name}/summary.txt - TensorBoard: runs/{run_name}/tensorboard/


Hyperparameter Tuning

Q: How do I tune hyperparameters?

A: Use the ml-search command for automated hyperparameter optimization:

# Install Optuna support
uv pip install -e ".[optuna]"

# Generate config with search space
ml-init-config data/my_dataset --optuna

# Run optimization
ml-search --config configs/my_dataset_config.yaml --n-trials 50

# Train with best hyperparameters
ml-train --config runs/optuna_studies/my_study/best_config.yaml

Or run manual experiments:

ml-train --lr 0.001 --batch_size 16
ml-train --lr 0.01 --batch_size 16
ml-train --lr 0.01 --batch_size 32

See: Workflow Guide


Q: What hyperparameters should I tune first?

A: Priority order: 1. Learning rate (--lr) 2. Batch size (--batch_size) 3. Number of epochs (--num_epochs) 4. Scheduler settings (--step_size, --gamma)

For automated search, use ml-search which will optimize multiple parameters simultaneously.


Reproducibility

Q: How do I make results reproducible?

A: Set a fixed seed and use deterministic mode:

seed: 42
deterministic: true

Note: deterministic: true is slower but guarantees exact reproducibility.

See: Reproducibility Configuration


Q: Why do I get slightly different results each run?

A: With deterministic: false (default), some operations are non-deterministic for speed. Set deterministic: true for exact reproduction.


Errors

Q: "Found 0 files in subfolders" error?

A: Incorrect data structure. Images must be inside class subfolders.

See: Data Preparation


Q: "RuntimeError: CUDA out of memory"?

A: Reduce batch size:

ml-train --batch_size 8

See: Troubleshooting


Q: Loss becomes NaN during training?

A: 1. Lower learning rate: ml-train --lr 0.0001 2. Check data normalization 3. Verify labels are correct


Advanced

Q: Can I use mixed precision training?

A: Yes! Use the mixed_precision trainer type in your config:

training:
  trainer_type: 'mixed_precision'
  amp_dtype: 'float16'
See: Advanced Training Guide


Q: Can I train on multiple GPUs?

A: Yes! Use the accelerate trainer type:

uv pip install accelerate
accelerate config
accelerate launch ml-train --config configs/my_config.yaml
See: Advanced Training Guide


Q: How do I add data augmentation?

A: Modify ml_src/dataset.py::get_transforms() to add more transforms.

See: Adding Transforms


Q: Can I use a different optimizer (Adam, AdamW)?

A: Yes, modify ml_src/optimizer.py::get_optimizer() to add more optimizers.

See: Adding Optimizers


Q: How do I implement early stopping?

A: Early stopping is built-in! Configure it in your config file:

training:
  early_stopping:
    enabled: true
    patience: 10
    metric: 'val_acc'
    mode: 'max'
Training will automatically stop if validation accuracy doesn't improve for 10 epochs.


Getting Help

Q: Where can I find more documentation?

A: - Documentation Index - Configuration Reference - Troubleshooting Guide


Q: Something isn't working. What should I check?

A: 1. Verify data structure: tree -L 2 data/my_dataset/ 2. Check configuration: cat runs/{run_name}/config.yaml 3. Review logs: cat runs/{run_name}/logs/train.log 4. Check system: python -c "import torch; print(torch.__version__)"

See: Troubleshooting


Still have questions? Check the full documentation at docs/README.md