Reference Documentation¶
Quick references, troubleshooting guides, and optimization resources for the PyTorch Image Classification framework.
Overview¶
This section provides practical reference materials designed for quick lookups during development and training. Whether you're debugging an issue, optimizing performance, or looking for best practices, these guides offer concise, actionable information to help you work efficiently.
Reference Documents¶
Best Practices¶
Essential tips and conventions for effective framework usage
Learn recommended approaches for configuration management, training workflows, hyperparameter tuning, data handling, and reproducibility. This guide helps you avoid common pitfalls and establish good habits from the start.
Key Topics: - Configuration management and versioning - Training workflow recommendations - Systematic hyperparameter tuning - Data verification and augmentation strategies - Code extension patterns - Reproducibility guidelines
When to use: Before starting new experiments, when establishing team conventions, or when unsure about recommended approaches.
Troubleshooting¶
Common issues and their solutions
Comprehensive troubleshooting guide covering installation problems, training issues, data errors, configuration mistakes, and inference problems. Each issue includes symptoms, causes, and step-by-step solutions.
Key Topics: - Installation and dependency issues - CUDA and GPU problems - Out of memory errors - Training failures (NaN loss, slow training, poor convergence) - Data loading errors - Configuration validation errors - Checkpoint and resume issues
When to use: When encountering errors, unexpected behavior, or performance issues. Check here first before deep debugging.
Performance Tuning¶
Speed and memory optimization strategies
Detailed guide for optimizing training speed and reducing memory usage. Learn how to maximize GPU utilization, accelerate data loading, and train larger models within memory constraints.
Key Topics: - Training speed optimization (batch size, data loading, determinism) - Memory usage reduction techniques - GPU utilization monitoring - Model-specific optimizations - Profiling and bottleneck identification - Mixed precision training considerations
When to use: When training is too slow, hitting memory limits, or optimizing resource utilization for production workflows.
FAQ¶
Frequently asked questions with quick answers
Quick answers to common questions organized by category. Includes general framework questions, configuration help, training workflows, data handling, model selection, and deployment topics.
Key Topics: - Supported models and architectures - Dataset requirements and organization - GPU vs CPU training - Resuming interrupted training - Configuration overrides - Checkpoint selection (best vs last) - Custom model integration - Multi-GPU training
When to use: For quick answers to common questions without reading full documentation sections.
Visualization¶
TensorBoard tools and ml-visualise command reference
Complete reference for the ml-visualise CLI command and TensorBoard visualization capabilities. Learn how to visualize datasets, inspect predictions, monitor training metrics, and manage TensorBoard servers.
Key Topics:
- ml-visualise CLI command modes and options
- Visualizing dataset samples
- Inspecting model predictions
- TensorBoard server management
- Log cleanup and organization
- Training metrics visualization
- Comparative experiment analysis
When to use: When setting up visualization, debugging data pipelines, analyzing model predictions, or comparing experiments.
Quick Access Guide¶
Most Commonly Needed Resources¶
Starting a new project? - Read: Best Practices → Configuration and Training sections
Encountering an error? - Check: Troubleshooting → Find your error message or symptom
Training too slow or out of memory? - Optimize: Performance Tuning → Speed or Memory sections
Quick question? - Search: FAQ → Organized by topic
Setting up visualization? - Reference: Visualization → ml-visualise modes
Common Scenarios¶
Scenario: First Time Training¶
- Review Best Practices - Configuration & Training
- Set up Visualization - Launch TensorBoard
- Keep Troubleshooting handy for any issues
Scenario: Optimizing Production Workflow¶
- Read Performance Tuning - All sections
- Apply Best Practices - Reproducibility
- Check FAQ - Multi-GPU and deployment topics
Scenario: Debugging Training Issues¶
- Check Troubleshooting - Training Issues section
- Verify configuration using FAQ
- Inspect data with Visualization - Samples mode
- Review Best Practices - Data section
Scenario: Team Onboarding¶
- Share Best Practices for conventions
- Bookmark FAQ for quick answers
- Reference Troubleshooting for common issues
- Demo Visualization tools
Reference Quick Links¶
| Document | Primary Use | Quick Jump |
|---|---|---|
| Best Practices | Conventions & recommendations | View → |
| Troubleshooting | Error resolution | View → |
| Performance Tuning | Speed & memory optimization | View → |
| FAQ | Quick answers | View → |
| Visualization | TensorBoard & ml-visualise | View → |
Integration with Other Documentation¶
Related Documentation Sections¶
For comprehensive guides: - User Guides - Complete workflows and how-tos - Configuration Reference - All configuration options
For system understanding: - Architecture - System design and code structure - Development - Extending the framework
For getting started: - Getting Started - Installation and quick start
Documentation Tips¶
Effective Reference Usage¶
- Bookmark frequently used sections - Keep quick access to relevant guides
- Use browser search (Ctrl+F / Cmd+F) - Find specific topics within documents
- Check FAQ first - Often the fastest path to answers
- Cross-reference - Troubleshooting often links to Performance Tuning and Best Practices
- Stay updated - Reference docs evolve with common user needs
When Reference Isn't Enough¶
If these quick references don't address your needs:
- Complex workflows: See User Guides
- Configuration questions: See Configuration Reference
- Understanding internals: See Architecture Documentation
- Custom development: See Development Guides
Quick Troubleshooting Checklist¶
Before deep debugging, verify:
- Data directory structure is correct (see Data Preparation)
- Configuration file is valid YAML
- GPU is available and utilized (
nvidia-smi) - Dependencies are installed (
uv pip install -e .) - Sufficient disk space for checkpoints and logs
- Correct Python and PyTorch versions
See Troubleshooting Guide for detailed solutions.
Performance Quick Wins¶
Common optimizations with immediate impact:
- Increase batch size - If GPU memory allows (
--batch_size 64) - More data workers - Speed up data loading (
--num_workers 8) - Reduce image size - If resolution isn't critical (modify transforms)
- Disable determinism - Faster but non-reproducible (usually default)
- Monitor GPU usage - Ensure near 100% utilization
See Performance Tuning for comprehensive strategies.
Contributing to Reference Documentation¶
Found a common issue not covered? Have optimization tips to share?
Reference documentation grows from user experience:
- Document solutions to problems you encountered
- Share optimization strategies that worked
- Suggest FAQ additions for repeated questions
- Clarify confusing sections
Reference docs should be concise, actionable, and regularly used.
Navigation¶
← Back to Main Documentation
Explore other documentation sections: - Getting Started - New user guides - Configuration - Complete config reference - User Guides - Practical workflows - Architecture - System design - Development - Extending the framework
Need help fast? Start with FAQ or Troubleshooting →