Reference Documentation¶

Quick references, troubleshooting guides, and optimization resources for the PyTorch Image Classification framework.

Overview¶

This section provides practical reference materials designed for quick lookups during development and training. Whether you're debugging an issue, optimizing performance, or looking for best practices, these guides offer concise, actionable information to help you work efficiently.

Reference Documents¶

Best Practices ¶

Essential tips and conventions for effective framework usage

Learn recommended approaches for configuration management, training workflows, hyperparameter tuning, data handling, and reproducibility. This guide helps you avoid common pitfalls and establish good habits from the start.

Key Topics: - Configuration management and versioning - Training workflow recommendations - Systematic hyperparameter tuning - Data verification and augmentation strategies - Code extension patterns - Reproducibility guidelines

When to use: Before starting new experiments, when establishing team conventions, or when unsure about recommended approaches.

Troubleshooting ¶

Common issues and their solutions

Comprehensive troubleshooting guide covering installation problems, training issues, data errors, configuration mistakes, and inference problems. Each issue includes symptoms, causes, and step-by-step solutions.

Key Topics: - Installation and dependency issues - CUDA and GPU problems - Out of memory errors - Training failures (NaN loss, slow training, poor convergence) - Data loading errors - Configuration validation errors - Checkpoint and resume issues

When to use: When encountering errors, unexpected behavior, or performance issues. Check here first before deep debugging.

Performance Tuning ¶

Speed and memory optimization strategies

Detailed guide for optimizing training speed and reducing memory usage. Learn how to maximize GPU utilization, accelerate data loading, and train larger models within memory constraints.

Key Topics: - Training speed optimization (batch size, data loading, determinism) - Memory usage reduction techniques - GPU utilization monitoring - Model-specific optimizations - Profiling and bottleneck identification - Mixed precision training considerations

When to use: When training is too slow, hitting memory limits, or optimizing resource utilization for production workflows.

FAQ ¶

Frequently asked questions with quick answers

Quick answers to common questions organized by category. Includes general framework questions, configuration help, training workflows, data handling, model selection, and deployment topics.

Key Topics: - Supported models and architectures - Dataset requirements and organization - GPU vs CPU training - Resuming interrupted training - Configuration overrides - Checkpoint selection (best vs last) - Custom model integration - Multi-GPU training

When to use: For quick answers to common questions without reading full documentation sections.

Visualization ¶

TensorBoard tools and ml-visualise command reference

Complete reference for the ml-visualise CLI command and TensorBoard visualization capabilities. Learn how to visualize datasets, inspect predictions, monitor training metrics, and manage TensorBoard servers.

Key Topics: - ml-visualise CLI command modes and options - Visualizing dataset samples - Inspecting model predictions - TensorBoard server management - Log cleanup and organization - Training metrics visualization - Comparative experiment analysis

When to use: When setting up visualization, debugging data pipelines, analyzing model predictions, or comparing experiments.

Quick Access Guide¶

Most Commonly Needed Resources¶

Starting a new project? - Read: Best Practices → Configuration and Training sections

Encountering an error? - Check: Troubleshooting → Find your error message or symptom

Training too slow or out of memory? - Optimize: Performance Tuning → Speed or Memory sections

Quick question? - Search: FAQ → Organized by topic

Setting up visualization? - Reference: Visualization → ml-visualise modes

Common Scenarios¶

Scenario: First Time Training¶

Review Best Practices - Configuration & Training
Set up Visualization - Launch TensorBoard
Keep Troubleshooting handy for any issues

Scenario: Optimizing Production Workflow¶

Read Performance Tuning - All sections
Apply Best Practices - Reproducibility
Check FAQ - Multi-GPU and deployment topics

Scenario: Debugging Training Issues¶

Check Troubleshooting - Training Issues section
Verify configuration using FAQ
Inspect data with Visualization - Samples mode
Review Best Practices - Data section

Scenario: Team Onboarding¶

Share Best Practices for conventions
Bookmark FAQ for quick answers
Reference Troubleshooting for common issues
Demo Visualization tools

Reference Quick Links¶

Document	Primary Use	Quick Jump
Best Practices	Conventions & recommendations	View →
Troubleshooting	Error resolution	View →
Performance Tuning	Speed & memory optimization	View →
FAQ	Quick answers	View →
Visualization	TensorBoard & ml-visualise	View →

Integration with Other Documentation¶

For comprehensive guides: - User Guides - Complete workflows and how-tos - Configuration Reference - All configuration options

For system understanding: - Architecture - System design and code structure - Development - Extending the framework

For getting started: - Getting Started - Installation and quick start

Documentation Tips¶

Effective Reference Usage¶

Bookmark frequently used sections - Keep quick access to relevant guides
Use browser search (Ctrl+F / Cmd+F) - Find specific topics within documents
Check FAQ first - Often the fastest path to answers
Cross-reference - Troubleshooting often links to Performance Tuning and Best Practices
Stay updated - Reference docs evolve with common user needs

When Reference Isn't Enough¶

If these quick references don't address your needs:

Complex workflows: See User Guides
Configuration questions: See Configuration Reference
Understanding internals: See Architecture Documentation
Custom development: See Development Guides

Quick Troubleshooting Checklist¶

Before deep debugging, verify:

Data directory structure is correct (see Data Preparation)
Configuration file is valid YAML
GPU is available and utilized (nvidia-smi)
Dependencies are installed (uv pip install -e .)
Sufficient disk space for checkpoints and logs
Correct Python and PyTorch versions

See Troubleshooting Guide for detailed solutions.

Performance Quick Wins¶

Common optimizations with immediate impact:

Increase batch size - If GPU memory allows (--batch_size 64)
More data workers - Speed up data loading (--num_workers 8)
Reduce image size - If resolution isn't critical (modify transforms)
Disable determinism - Faster but non-reproducible (usually default)
Monitor GPU usage - Ensure near 100% utilization

See Performance Tuning for comprehensive strategies.

Contributing to Reference Documentation¶

Found a common issue not covered? Have optimization tips to share?

Reference documentation grows from user experience:

Document solutions to problems you encountered
Share optimization strategies that worked
Suggest FAQ additions for repeated questions
Clarify confusing sections

Reference docs should be concise, actionable, and regularly used.

← Back to Main Documentation

Explore other documentation sections: - Getting Started - New user guides - Configuration - Complete config reference - User Guides - Practical workflows - Architecture - System design - Development - Extending the framework

Need help fast? Start with FAQ or Troubleshooting →