Skip to content

Reference Documentation

Quick references, troubleshooting guides, and optimization resources for the PyTorch Image Classification framework.

Overview

This section provides practical reference materials designed for quick lookups during development and training. Whether you're debugging an issue, optimizing performance, or looking for best practices, these guides offer concise, actionable information to help you work efficiently.


Reference Documents

Best Practices

Essential tips and conventions for effective framework usage

Learn recommended approaches for configuration management, training workflows, hyperparameter tuning, data handling, and reproducibility. This guide helps you avoid common pitfalls and establish good habits from the start.

Key Topics: - Configuration management and versioning - Training workflow recommendations - Systematic hyperparameter tuning - Data verification and augmentation strategies - Code extension patterns - Reproducibility guidelines

When to use: Before starting new experiments, when establishing team conventions, or when unsure about recommended approaches.


Troubleshooting

Common issues and their solutions

Comprehensive troubleshooting guide covering installation problems, training issues, data errors, configuration mistakes, and inference problems. Each issue includes symptoms, causes, and step-by-step solutions.

Key Topics: - Installation and dependency issues - CUDA and GPU problems - Out of memory errors - Training failures (NaN loss, slow training, poor convergence) - Data loading errors - Configuration validation errors - Checkpoint and resume issues

When to use: When encountering errors, unexpected behavior, or performance issues. Check here first before deep debugging.


Performance Tuning

Speed and memory optimization strategies

Detailed guide for optimizing training speed and reducing memory usage. Learn how to maximize GPU utilization, accelerate data loading, and train larger models within memory constraints.

Key Topics: - Training speed optimization (batch size, data loading, determinism) - Memory usage reduction techniques - GPU utilization monitoring - Model-specific optimizations - Profiling and bottleneck identification - Mixed precision training considerations

When to use: When training is too slow, hitting memory limits, or optimizing resource utilization for production workflows.


FAQ

Frequently asked questions with quick answers

Quick answers to common questions organized by category. Includes general framework questions, configuration help, training workflows, data handling, model selection, and deployment topics.

Key Topics: - Supported models and architectures - Dataset requirements and organization - GPU vs CPU training - Resuming interrupted training - Configuration overrides - Checkpoint selection (best vs last) - Custom model integration - Multi-GPU training

When to use: For quick answers to common questions without reading full documentation sections.


Visualization

TensorBoard tools and ml-visualise command reference

Complete reference for the ml-visualise CLI command and TensorBoard visualization capabilities. Learn how to visualize datasets, inspect predictions, monitor training metrics, and manage TensorBoard servers.

Key Topics: - ml-visualise CLI command modes and options - Visualizing dataset samples - Inspecting model predictions - TensorBoard server management - Log cleanup and organization - Training metrics visualization - Comparative experiment analysis

When to use: When setting up visualization, debugging data pipelines, analyzing model predictions, or comparing experiments.


Quick Access Guide

Most Commonly Needed Resources

Starting a new project? - Read: Best Practices → Configuration and Training sections

Encountering an error? - Check: Troubleshooting → Find your error message or symptom

Training too slow or out of memory? - Optimize: Performance Tuning → Speed or Memory sections

Quick question? - Search: FAQ → Organized by topic

Setting up visualization? - Reference: Visualization → ml-visualise modes


Common Scenarios

Scenario: First Time Training

  1. Review Best Practices - Configuration & Training
  2. Set up Visualization - Launch TensorBoard
  3. Keep Troubleshooting handy for any issues

Scenario: Optimizing Production Workflow

  1. Read Performance Tuning - All sections
  2. Apply Best Practices - Reproducibility
  3. Check FAQ - Multi-GPU and deployment topics

Scenario: Debugging Training Issues

  1. Check Troubleshooting - Training Issues section
  2. Verify configuration using FAQ
  3. Inspect data with Visualization - Samples mode
  4. Review Best Practices - Data section

Scenario: Team Onboarding

  1. Share Best Practices for conventions
  2. Bookmark FAQ for quick answers
  3. Reference Troubleshooting for common issues
  4. Demo Visualization tools

Document Primary Use Quick Jump
Best Practices Conventions & recommendations View →
Troubleshooting Error resolution View →
Performance Tuning Speed & memory optimization View →
FAQ Quick answers View →
Visualization TensorBoard & ml-visualise View →

Integration with Other Documentation

For comprehensive guides: - User Guides - Complete workflows and how-tos - Configuration Reference - All configuration options

For system understanding: - Architecture - System design and code structure - Development - Extending the framework

For getting started: - Getting Started - Installation and quick start


Documentation Tips

Effective Reference Usage

  1. Bookmark frequently used sections - Keep quick access to relevant guides
  2. Use browser search (Ctrl+F / Cmd+F) - Find specific topics within documents
  3. Check FAQ first - Often the fastest path to answers
  4. Cross-reference - Troubleshooting often links to Performance Tuning and Best Practices
  5. Stay updated - Reference docs evolve with common user needs

When Reference Isn't Enough

If these quick references don't address your needs:


Quick Troubleshooting Checklist

Before deep debugging, verify:

  • Data directory structure is correct (see Data Preparation)
  • Configuration file is valid YAML
  • GPU is available and utilized (nvidia-smi)
  • Dependencies are installed (uv pip install -e .)
  • Sufficient disk space for checkpoints and logs
  • Correct Python and PyTorch versions

See Troubleshooting Guide for detailed solutions.


Performance Quick Wins

Common optimizations with immediate impact:

  1. Increase batch size - If GPU memory allows (--batch_size 64)
  2. More data workers - Speed up data loading (--num_workers 8)
  3. Reduce image size - If resolution isn't critical (modify transforms)
  4. Disable determinism - Faster but non-reproducible (usually default)
  5. Monitor GPU usage - Ensure near 100% utilization

See Performance Tuning for comprehensive strategies.


Contributing to Reference Documentation

Found a common issue not covered? Have optimization tips to share?

Reference documentation grows from user experience:

  1. Document solutions to problems you encountered
  2. Share optimization strategies that worked
  3. Suggest FAQ additions for repeated questions
  4. Clarify confusing sections

Reference docs should be concise, actionable, and regularly used.


← Back to Main Documentation

Explore other documentation sections: - Getting Started - New user guides - Configuration - Complete config reference - User Guides - Practical workflows - Architecture - System design - Development - Extending the framework


Need help fast? Start with FAQ or Troubleshooting