skip to content

Search

8. Performance Optimizations

8.1 Summary of Optimizations

OptimizationSpeedupMemory SavingImplementation
Mixed Precision (AMP)1.5-2×30-40%torch.amp.autocast
channels_last_3d5-15%Noneto(memory_format=...)
MIOpen Benchmarking10-30%NoneMIOPEN_FIND_MODE
OneCycleLR1.5-2× (convergence)NoneOneCycleLR scheduler
Persistent Workers5-10%Moderatepersistent_workers=True
Gradient CheckpointingNone30-40%use_checkpointing=True
DDP Static Graph5-10%Nonestatic_graph=True

8.2 Mixed Precision Details

FP16 vs FP32 Operations:

OperationPrecisionReason
Conv3dFP16Compute-intensive, tolerant
LinearFP16Matrix multiply
GroupNormFP32Numerical stability
LossFP32Avoid underflow
Gradients (internal)FP16Memory
Gradients (optimizer)FP32Accumulation stability

ROCm AMP Considerations:

  • MI250X supports FP16 natively
  • bfloat16 not widely supported (use FP16)
  • Some ops slower in FP16 on ROCm (ProfilerActivity can identify)

8.3 Memory Format

channels_last vs contiguous:

# Contiguous (NCDHW): [N, C, D, H, W]
# Memory layout: N₀C₀D₀H₀W₀, N₀C₀D₀H₀W₁, ..., N₀C₀D₀H₁W₀, ...
 
# channels_last_3d (NDHWC): [N, D, H, W, C]
# Memory layout: N₀D₀H₀W₀C₀, N₀D₀H₀W₀C₁, ..., N₀D₀H₀W₁C₀, ...

Why Faster?:

  • 3D convolution accesses channels in inner loop
  • channels_last_3d has better cache locality
  • Fewer cache misses → faster compute

When to Use:

  • ✓ 3D CNNs with many channels
  • ✓ Modern GPUs (ROCm 5.0+, CUDA 11+)
  • ✗ CPU inference (may be slower)
  • ✗ Custom CUDA kernels (may not support)

8.4 DataLoader Tuning

Guidelines:

# For fast GPUs (MI250X, A100)
num_workers = 4-8
pin_memory = True
persistent_workers = True
prefetch_factor = 3
 
# For slower GPUs (V100, T4)
num_workers = 2-4
pin_memory = True
persistent_workers = False
prefetch_factor = 2
 
# For CPU inference
num_workers = 0
pin_memory = False

Profiling:

import time
 
t0 = time.time()
for batch in train_loader:
    t1 = time.time()
    print(f"Data loading: {t1 - t0:.3f}s")
    
    # Training step
    ...
    
    t2 = time.time()
    print(f"Training: {t2 - t1:.3f}s")
    t0 = t2

Goal: Data loading time < Training time

8.5 Compute vs Memory Tradeoffs

ConfigurationComputeMemoryUse Case
base_channels=64, no checkpointingFastestHighestA100 80GB
base_channels=32, no checkpointingFastHighA100 40GB, MI250X
base_channels=16, no checkpointingMediumMediumV100 32GB
base_channels=32, checkpointingSlowMediumV100 32GB
base_channels=16, checkpointingSlowerLowV100 16GB