Add CLIP fine-tuning pipeline for logo recognition

Implement contrastive learning with LoRA to fine-tune CLIP's vision encoder on LogoDet-3K dataset for improved logo embedding similarity. New training module (training/): - config.py: TrainingConfig dataclass with all hyperparameters - dataset.py: LogoContrastiveDataset with logo-level splits - model.py: LogoFineTunedCLIP wrapper with LoRA support - losses.py: InfoNCE, TripletLoss, SupConLoss implementations - trainer.py: Training loop with mixed precision and checkpointing - evaluation.py: EmbeddingEvaluator for validation metrics New scripts: - train_clip_logo.py: Main training entry point - export_model.py: Export to HuggingFace-compatible format Configurations: - configs/jetson_orin.yaml: Optimized for Jetson Orin AGX - configs/cloud_rtx4090.yaml: Optimized for 24GB cloud GPUs - configs/cloud_a100.yaml: Optimized for 80GB cloud GPUs Documentation: - CLIP_FINETUNING.md: Training guide and usage instructions - CLOUD_TRAINING.md: Cloud GPU recommendations and cost estimates Modified: - logo_detection_detr.py: Add fine-tuned model loading support - pyproject.toml: Add peft, pyyaml, torchvision dependencies
2026-01-04 13:45:25 -05:00
parent 1551360028
commit 44e8b6ae7d
16 changed files with 3334 additions and 12 deletions
--- a/CLIP_FINETUNING.md
+++ b/CLIP_FINETUNING.md
@ -0,0 +1,266 @@
+# CLIP Fine-Tuning for Logo Recognition
+
+This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset.
+
+## Overview
+
+The fine-tuning approach uses **contrastive learning** with **LoRA** (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing `DetectLogosDETR` class.
+
+**Goal**: Improve F1 from ~60% to >72% on logo matching tasks.
+
+## Files Created
+
+### Training Module (`training/`)
+
+| File | Description |
+|------|-------------|
+| `__init__.py` | Module exports |
+| `config.py` | `TrainingConfig` dataclass with all hyperparameters |
+| `dataset.py` | `LogoContrastiveDataset` with logo-level splits and augmentations |
+| `model.py` | `LogoFineTunedCLIP` wrapper with LoRA support |
+| `losses.py` | `InfoNCELoss`, `TripletLoss`, `SupConLoss`, `CombinedLoss` |
+| `trainer.py` | Training loop with mixed precision, checkpointing, early stopping |
+| `evaluation.py` | `EmbeddingEvaluator` for validation metrics |
+
+### Scripts
+
+| File | Description |
+|------|-------------|
+| `train_clip_logo.py` | Main training entry point |
+| `export_model.py` | Export trained models to HuggingFace-compatible format |
+
+### Configuration
+
+| File | Description |
+|------|-------------|
+| `configs/jetson_orin.yaml` | Training config optimized for Jetson Orin AGX |
+
+## Prerequisites
+
+1. **Install dependencies**:
+   ```bash
+   uv sync
+   ```
+
+2. **Prepare test data** (if not already done):
+   ```bash
+   uv run python prepare_test_data.py
+   ```
+
+   This creates:
+   - `reference_logos/` - Cropped logo images organized by category/brand
+   - `test_images/` - Full images for testing
+   - `test_data_mapping.db` - SQLite database with mappings
+
+## Training
+
+### Basic Training
+
+```bash
+uv run python train_clip_logo.py --config configs/jetson_orin.yaml
+```
+
+### Training with Overrides
+
+```bash
+uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
+    --learning-rate 5e-6 \
+    --max-epochs 30 \
+    --batch-size 8
+```
+
+### Resume from Checkpoint
+
+```bash
+uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
+    --resume checkpoints/epoch_10.pt
+```
+
+### Training Output
+
+- Checkpoints saved to `checkpoints/`
+- Best model saved as `checkpoints/best.pt`
+- Final model exported to `models/logo_detection/clip_finetuned/`
+
+## Configuration Options
+
+Key parameters in `configs/jetson_orin.yaml`:
+
+```yaml
+# Model
+base_model: "openai/clip-vit-large-patch14"
+lora_r: 16                    # LoRA rank (0 to disable)
+lora_alpha: 32                # LoRA scaling factor
+freeze_layers: 12             # Freeze first N transformer layers
+
+# Batch construction
+batch_size: 16
+logos_per_batch: 32           # Different logos per batch
+samples_per_logo: 4           # Samples per logo (creates positive pairs)
+gradient_accumulation_steps: 8  # Effective batch = 128
+
+# Training
+learning_rate: 1.0e-5
+max_epochs: 20
+mixed_precision: true
+temperature: 0.07             # InfoNCE temperature
+
+# Early stopping
+patience: 5
+min_delta: 0.001
+```
+
+## Evaluation
+
+### Test Fine-Tuned Model
+
+```bash
+uv run python test_logo_detection.py -n 50 \
+    -e models/logo_detection/clip_finetuned \
+    --matching-method multi-ref \
+    --seed 42
+```
+
+### Compare with Baseline
+
+```bash
+# Baseline CLIP
+uv run python test_logo_detection.py -n 50 \
+    -e openai/clip-vit-large-patch14 \
+    --matching-method multi-ref \
+    --seed 42
+
+# Fine-tuned model
+uv run python test_logo_detection.py -n 50 \
+    -e models/logo_detection/clip_finetuned \
+    --matching-method multi-ref \
+    --seed 42
+```
+
+### Expected Metrics
+
+| Metric | Baseline CLIP | Target (Fine-tuned) |
+|--------|---------------|---------------------|
+| Precision | ~49% | >70% |
+| Recall | ~77% | >75% |
+| F1 Score | ~60% | >72% |
+
+Training metrics to monitor:
+- Mean positive similarity: target > 0.85
+- Mean negative similarity: target < 0.50
+- Embedding separation: target > 0.35
+
+## Export Model
+
+To export a checkpoint to HuggingFace format:
+
+```bash
+uv run python export_model.py \
+    --checkpoint checkpoints/best.pt \
+    --output models/logo_detection/clip_finetuned
+```
+
+With LoRA weight merging (reduces inference overhead):
+
+```bash
+uv run python export_model.py \
+    --checkpoint checkpoints/best.pt \
+    --output models/logo_detection/clip_finetuned \
+    --merge-lora
+```
+
+## Using Fine-Tuned Model with DetectLogosDETR
+
+The fine-tuned model works as a drop-in replacement:
+
+```python
+from logo_detection_detr import DetectLogosDETR
+
+# Use fine-tuned model
+detector = DetectLogosDETR(
+    logger=logger,
+    embedding_model="models/logo_detection/clip_finetuned",
+)
+
+# Or use baseline for comparison
+detector_baseline = DetectLogosDETR(
+    logger=logger,
+    embedding_model="openai/clip-vit-large-patch14",
+)
+```
+
+## Architecture Details
+
+### Training Approach
+
+1. **Contrastive Learning**: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos.
+
+2. **LoRA (Low-Rank Adaptation)**: Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting.
+
+3. **Layer Freezing**: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics.
+
+4. **Logo-Level Splits**: Splits data by logo brand (not by image) to test generalization to unseen logos.
+
+### Batch Construction
+
+Each batch contains:
+- K different logo brands (default: 32)
+- M samples per brand (default: 4)
+- Total samples: K × M = 128
+
+This ensures positive pairs (same logo) exist within each batch for contrastive learning.
+
+### Data Augmentation
+
+Medium strength augmentations:
+- Random horizontal flip
+- Random rotation (±15°)
+- Color jitter (brightness, contrast, saturation)
+- Random affine transforms
+- Random grayscale (10% of images)
+
+## Troubleshooting
+
+### Out of Memory
+
+Reduce batch size and increase gradient accumulation:
+
+```bash
+uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
+    --batch-size 8 \
+    --gradient-accumulation-steps 16
+```
+
+### Slow Training
+
+Ensure mixed precision is enabled:
+
+```bash
+uv run python train_clip_logo.py --config configs/jetson_orin.yaml
+# mixed_precision: true is default in jetson_orin.yaml
+```
+
+### No Improvement
+
+Try adjusting:
+- Lower learning rate: `--learning-rate 5e-6`
+- Higher temperature: `--temperature 0.1`
+- Different loss: edit config to use `loss_type: "combined"`
+
+### Import Error for Fine-Tuned Model
+
+Ensure the `training/` module is in your Python path:
+
+```bash
+export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test"
+```
+
+## Dependencies Added
+
+The following were added to `pyproject.toml`:
+
+```toml
+peft>=0.7.0        # LoRA support
+pyyaml>=6.0        # Config file parsing
+torchvision>=0.20.0  # Image transforms
+```
--- a/CLOUD_TRAINING.md
+++ b/CLOUD_TRAINING.md
@ -0,0 +1,269 @@
+# Cloud GPU Training for CLIP Fine-Tuning
+
+This document provides guidance on using cloud GPU instances (e.g., RunPod) for faster CLIP fine-tuning compared to local training on Jetson Orin AGX.
+
+## Training Time Comparison
+
+Local training on Jetson Orin AGX takes approximately 24 hours. Cloud GPUs offer significantly faster training:
+
+| GPU | VRAM | Est. Training Time | Hourly Rate | Est. Total Cost |
+|-----|------|-------------------|-------------|-----------------|
+| **RTX 4090** | 24GB | 4-6 hours | $0.59/hr | **$2.40-$3.50** |
+| **RTX 3090** | 24GB | 5-7 hours | $0.39/hr | **$2.00-$2.75** |
+| **A100 80GB** | 80GB | 2-3 hours | $1.99/hr | **$4.00-$6.00** |
+| **L40S** | 48GB | 3-4 hours | $0.89/hr | **$2.70-$3.60** |
+| **H100 80GB** | 80GB | 1.5-2 hours | $1.99/hr | **$3.00-$4.00** |
+
+*Prices from RunPod Community Cloud as of January 2025. Rates may vary.*
+
+## Recommendations
+
+### Best Value: RTX 4090 ($0.59/hr)
+- 24GB VRAM is sufficient for ViT-L/14 with LoRA
+- Good balance of speed and cost
+- Widely available on Community Cloud
+- **Total cost: ~$3 for complete training**
+
+### Best Speed: H100 80GB ($1.99/hr)
+- Fastest training (1.5-2 hours)
+- 80GB VRAM allows larger batch sizes
+- Can increase `batch_size` to 32+ and reduce `gradient_accumulation_steps`
+- **Total cost: ~$3-4**
+
+### Budget Option: RTX 3090 ($0.39/hr)
+- Cheapest hourly rate
+- 24GB VRAM works fine
+- Slightly slower than 4090
+- **Total cost: ~$2-3**
+
+## Cloud-Optimized Configurations
+
+### RTX 4090 / RTX 3090 (24GB VRAM)
+
+Create `configs/cloud_rtx4090.yaml`:
+
+```yaml
+# Optimized for 24GB VRAM cloud GPUs
+base_model: "openai/clip-vit-large-patch14"
+
+# Dataset paths
+dataset_dir: "LogoDet-3K"
+reference_dir: "reference_logos"
+db_path: "test_data_mapping.db"
+
+# Data splits
+train_split: 0.7
+val_split: 0.15
+test_split: 0.15
+
+# Larger batches for faster training
+batch_size: 32
+logos_per_batch: 32
+samples_per_logo: 4
+gradient_accumulation_steps: 4  # Effective batch = 128
+num_workers: 8
+
+# Model architecture
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.1
+freeze_layers: 12
+use_gradient_checkpointing: true
+
+# Training
+learning_rate: 1.0e-5
+weight_decay: 0.01
+warmup_steps: 500
+max_epochs: 20
+mixed_precision: true
+
+# Loss
+temperature: 0.07
+loss_type: "infonce"
+
+# Early stopping
+patience: 5
+min_delta: 0.001
+
+# Output
+checkpoint_dir: "checkpoints"
+output_dir: "models/logo_detection/clip_finetuned"
+save_every_n_epochs: 5
+
+# Logging
+log_every_n_steps: 10
+eval_every_n_epochs: 1
+
+seed: 42
+use_augmentation: true
+augmentation_strength: "medium"
+```
+
+### A100 / H100 (80GB VRAM)
+
+Create `configs/cloud_a100.yaml`:
+
+```yaml
+# Optimized for 80GB VRAM cloud GPUs (A100, H100)
+base_model: "openai/clip-vit-large-patch14"
+
+# Dataset paths
+dataset_dir: "LogoDet-3K"
+reference_dir: "reference_logos"
+db_path: "test_data_mapping.db"
+
+# Data splits
+train_split: 0.7
+val_split: 0.15
+test_split: 0.15
+
+# Maximum batch sizes for 80GB VRAM
+batch_size: 64
+logos_per_batch: 32
+samples_per_logo: 4
+gradient_accumulation_steps: 2  # Effective batch = 128
+num_workers: 8
+
+# Model architecture (can disable gradient checkpointing with 80GB)
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.1
+freeze_layers: 12
+use_gradient_checkpointing: false  # Not needed with 80GB
+
+# Training
+learning_rate: 1.0e-5
+weight_decay: 0.01
+warmup_steps: 500
+max_epochs: 20
+mixed_precision: true
+
+# Loss
+temperature: 0.07
+loss_type: "infonce"
+
+# Early stopping
+patience: 5
+min_delta: 0.001
+
+# Output
+checkpoint_dir: "checkpoints"
+output_dir: "models/logo_detection/clip_finetuned"
+save_every_n_epochs: 5
+
+# Logging
+log_every_n_steps: 10
+eval_every_n_epochs: 1
+
+seed: 42
+use_augmentation: true
+augmentation_strength: "medium"
+```
+
+## RunPod Quick Start
+
+### 1. Create a Pod
+
+1. Go to [RunPod](https://www.runpod.io/)
+2. Select GPU (RTX 4090 recommended)
+3. Choose PyTorch template (CUDA 12.x)
+4. Set volume size: 50GB (for dataset + models)
+
+### 2. Setup Environment
+
+```bash
+# Connect via SSH or web terminal
+
+# Install dependencies
+pip install peft pyyaml torchvision transformers tqdm pillow
+
+# Clone your repository (or upload files)
+git clone <your-repo-url>
+cd logo_test
+
+# Or use runpodctl to sync files
+# runpodctl send logo_test/
+```
+
+### 3. Prepare Data
+
+If data isn't already prepared:
+
+```bash
+# This creates reference_logos/ and test_data_mapping.db
+python prepare_test_data.py
+```
+
+### 4. Run Training
+
+```bash
+# For RTX 4090
+python train_clip_logo.py --config configs/cloud_rtx4090.yaml
+
+# For A100/H100
+python train_clip_logo.py --config configs/cloud_a100.yaml
+
+# Or with command-line overrides
+python train_clip_logo.py --config configs/jetson_orin.yaml \
+    --batch-size 32 \
+    --gradient-accumulation-steps 4 \
+    --num-workers 8
+```
+
+### 5. Download Results
+
+```bash
+# Export the trained model
+python export_model.py \
+    --checkpoint checkpoints/best.pt \
+    --output models/logo_detection/clip_finetuned
+
+# Download to local machine
+# Option 1: Use runpodctl
+runpodctl receive models/logo_detection/clip_finetuned
+
+# Option 2: SCP
+scp -r root@<pod-ip>:/workspace/logo_test/models/logo_detection/clip_finetuned ./
+
+# Option 3: Compress and download via web
+tar -czvf clip_finetuned.tar.gz models/logo_detection/clip_finetuned
+```
+
+## Cost Optimization Tips
+
+### Use Spot/Interruptible Instances
+- Community Cloud GPUs are already cheaper
+- Some providers offer spot pricing for additional savings
+- Save checkpoints frequently (`save_every_n_epochs: 2`)
+
+### Minimize Storage Costs
+- RunPod charges $0.10/GB/month for container disk
+- Use network volumes only if needed
+- Delete pods when training completes
+
+### Monitor Training
+- Watch for early convergence (may finish before 20 epochs)
+- Early stopping will save time/cost if no improvement
+
+### Batch Training Runs
+- Test configuration locally first (1-2 epochs)
+- Run full training on cloud only when config is validated
+
+## Cost Comparison Summary
+
+| Option | Time | Cost | Best For |
+|--------|------|------|----------|
+| Jetson Orin (local) | ~24 hrs | Free* | No cloud dependency |
+| RTX 3090 (RunPod) | ~6 hrs | ~$2.50 | Lowest cost |
+| RTX 4090 (RunPod) | ~5 hrs | ~$3.00 | Best value |
+| L40S (RunPod) | ~3.5 hrs | ~$3.00 | Good balance |
+| A100 80GB (RunPod) | ~2.5 hrs | ~$5.00 | Large batches |
+| H100 80GB (RunPod) | ~1.5 hrs | ~$3.50 | Fastest |
+
+*Local training has electricity cost but no cloud fees.
+
+## References
+
+- [RunPod Pricing](https://www.runpod.io/pricing)
+- [RunPod RTX 4090](https://www.runpod.io/gpu-models/rtx-4090)
+- [RunPod Documentation](https://docs.runpod.io/)
--- a/configs/cloud_a100.yaml
+++ b/configs/cloud_a100.yaml
@ -0,0 +1,64 @@
+# Training configuration optimized for cloud A100 / H100 (80GB VRAM)
+#
+# Usage:
+#   python train_clip_logo.py --config configs/cloud_a100.yaml
+#
+# Estimated training time: 1.5-3 hours
+# Estimated cost on RunPod: ~$3-6
+
+# Base model
+base_model: "openai/clip-vit-large-patch14"
+
+# Dataset paths
+dataset_dir: "LogoDet-3K"
+reference_dir: "reference_logos"
+db_path: "test_data_mapping.db"
+
+# Data splits
+train_split: 0.7
+val_split: 0.15
+test_split: 0.15
+
+# Maximum batch sizes for 80GB VRAM
+batch_size: 64
+logos_per_batch: 32
+samples_per_logo: 4
+gradient_accumulation_steps: 2  # Effective batch = 128
+num_workers: 8
+
+# Model architecture (no gradient checkpointing needed with 80GB)
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.1
+freeze_layers: 12
+use_gradient_checkpointing: false
+
+# Training
+learning_rate: 1.0e-5
+weight_decay: 0.01
+warmup_steps: 500
+max_epochs: 20
+mixed_precision: true
+
+# Loss
+temperature: 0.07
+loss_type: "infonce"
+triplet_margin: 0.3
+
+# Early stopping
+patience: 5
+min_delta: 0.001
+
+# Output
+checkpoint_dir: "checkpoints"
+output_dir: "models/logo_detection/clip_finetuned"
+save_every_n_epochs: 2  # Save more frequently for cloud
+
+# Logging
+log_every_n_steps: 10
+eval_every_n_epochs: 1
+
+seed: 42
+use_hard_negatives: false
+use_augmentation: true
+augmentation_strength: "medium"
--- a/configs/cloud_rtx4090.yaml
+++ b/configs/cloud_rtx4090.yaml
@ -0,0 +1,64 @@
+# Training configuration optimized for cloud RTX 4090 / RTX 3090 (24GB VRAM)
+#
+# Usage:
+#   python train_clip_logo.py --config configs/cloud_rtx4090.yaml
+#
+# Estimated training time: 4-6 hours
+# Estimated cost on RunPod: ~$3
+
+# Base model
+base_model: "openai/clip-vit-large-patch14"
+
+# Dataset paths
+dataset_dir: "LogoDet-3K"
+reference_dir: "reference_logos"
+db_path: "test_data_mapping.db"
+
+# Data splits
+train_split: 0.7
+val_split: 0.15
+test_split: 0.15
+
+# Larger batches for faster training on 24GB VRAM
+batch_size: 32
+logos_per_batch: 32
+samples_per_logo: 4
+gradient_accumulation_steps: 4  # Effective batch = 128
+num_workers: 8
+
+# Model architecture
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.1
+freeze_layers: 12
+use_gradient_checkpointing: true
+
+# Training
+learning_rate: 1.0e-5
+weight_decay: 0.01
+warmup_steps: 500
+max_epochs: 20
+mixed_precision: true
+
+# Loss
+temperature: 0.07
+loss_type: "infonce"
+triplet_margin: 0.3
+
+# Early stopping
+patience: 5
+min_delta: 0.001
+
+# Output
+checkpoint_dir: "checkpoints"
+output_dir: "models/logo_detection/clip_finetuned"
+save_every_n_epochs: 2  # Save more frequently for cloud
+
+# Logging
+log_every_n_steps: 10
+eval_every_n_epochs: 1
+
+seed: 42
+use_hard_negatives: false
+use_augmentation: true
+augmentation_strength: "medium"
--- a/configs/jetson_orin.yaml
+++ b/configs/jetson_orin.yaml
@ -0,0 +1,76 @@
+# Training configuration optimized for Jetson Orin AGX (~64GB shared memory)
+#
+# Usage:
+#   uv run python train_clip_logo.py --config configs/jetson_orin.yaml
+
+# Base model
+base_model: "openai/clip-vit-large-patch14"
+
+# Dataset paths (relative to project root)
+dataset_dir: "LogoDet-3K"
+reference_dir: "reference_logos"
+db_path: "test_data_mapping.db"
+
+# Data split ratios (logo-level split for generalization testing)
+train_split: 0.7
+val_split: 0.15
+test_split: 0.15
+
+# Batch construction
+# - batch_size: Number of batches loaded at once (keep low for memory)
+# - logos_per_batch: Different logo classes per contrastive batch
+# - samples_per_logo: Samples of each logo (creates positive pairs)
+# - Effective samples per step = logos_per_batch * samples_per_logo = 128
+batch_size: 16
+logos_per_batch: 32
+samples_per_logo: 4
+gradient_accumulation_steps: 8  # Effective batch = 128
+num_workers: 4
+
+# Model architecture
+# LoRA enables memory-efficient fine-tuning by training low-rank adapters
+# instead of full model weights
+lora_r: 16                      # LoRA rank (0 to disable)
+lora_alpha: 32                  # LoRA scaling factor
+lora_dropout: 0.1               # Dropout in LoRA layers
+freeze_layers: 12               # Freeze first 12 of 24 transformer layers
+use_gradient_checkpointing: true  # Trade compute for memory
+
+# Training hyperparameters
+learning_rate: 1.0e-5           # Conservative LR for fine-tuning
+weight_decay: 0.01              # L2 regularization
+warmup_steps: 500               # LR warmup steps
+max_epochs: 20                  # Maximum training epochs
+mixed_precision: true           # FP16 training for memory efficiency
+
+# Loss function
+# InfoNCE is the contrastive loss used in CLIP training
+temperature: 0.07               # Similarity scaling (0.05-0.1 typical)
+loss_type: "infonce"            # Options: infonce, supcon, triplet, combined
+triplet_margin: 0.3             # Only used if loss_type is triplet
+
+# Early stopping
+patience: 5                     # Stop if no improvement for N epochs
+min_delta: 0.001                # Minimum improvement threshold
+
+# Checkpoints and output
+checkpoint_dir: "checkpoints"
+output_dir: "models/logo_detection/clip_finetuned"
+save_every_n_epochs: 5
+
+# Logging
+log_every_n_steps: 10
+eval_every_n_epochs: 1
+
+# Reproducibility
+seed: 42
+
+# Hard negative mining (advanced)
+# Enable after initial training epochs for harder examples
+use_hard_negatives: false
+hard_negative_start_epoch: 5
+hard_negatives_per_logo: 10
+
+# Data augmentation
+use_augmentation: true
+augmentation_strength: "medium"  # light, medium, or strong
--- a/export_model.py
+++ b/export_model.py
@ -0,0 +1,169 @@
+#!/usr/bin/env python3
+"""
+Export a trained CLIP model to HuggingFace-compatible format.
+
+This script converts a training checkpoint to a format that can be
+loaded by DetectLogosDETR for inference.
+
+Usage:
+    uv run python export_model.py \
+        --checkpoint checkpoints/best.pt \
+        --output models/logo_detection/clip_finetuned
+
+    # With custom base model
+    uv run python export_model.py \
+        --checkpoint checkpoints/best.pt \
+        --output models/logo_detection/clip_finetuned \
+        --base-model openai/clip-vit-large-patch14
+"""
+
+import argparse
+import json
+import logging
+import sys
+from pathlib import Path
+
+import torch
+
+from training.config import TrainingConfig
+from training.model import create_model, LogoFineTunedCLIP
+
+
+def setup_logging() -> logging.Logger:
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s [%(levelname)s] %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+    return logging.getLogger(__name__)
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description="Export trained CLIP model for inference",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+
+    parser.add_argument(
+        "--checkpoint",
+        type=str,
+        required=True,
+        help="Path to training checkpoint (.pt file)",
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        required=True,
+        help="Output directory for exported model",
+    )
+    parser.add_argument(
+        "--base-model",
+        type=str,
+        default=None,
+        help="Base CLIP model (reads from checkpoint config if not specified)",
+    )
+    parser.add_argument(
+        "--merge-lora",
+        action="store_true",
+        help="Merge LoRA weights into base model (reduces inference overhead)",
+    )
+
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+    logger = setup_logging()
+
+    logger.info("CLIP Model Export")
+    logger.info("=" * 60)
+
+    # Check checkpoint exists
+    checkpoint_path = Path(args.checkpoint)
+    if not checkpoint_path.exists():
+        logger.error(f"Checkpoint not found: {checkpoint_path}")
+        sys.exit(1)
+
+    # Load checkpoint
+    logger.info(f"Loading checkpoint: {checkpoint_path}")
+    checkpoint = torch.load(checkpoint_path, map_location="cpu")
+
+    # Get config from checkpoint
+    if "config" in checkpoint:
+        config_dict = checkpoint["config"]
+        base_model = args.base_model or config_dict.get(
+            "base_model", "openai/clip-vit-large-patch14"
+        )
+        lora_r = config_dict.get("lora_r", 16)
+        lora_alpha = config_dict.get("lora_alpha", 32)
+        freeze_layers = config_dict.get("freeze_layers", 12)
+    else:
+        base_model = args.base_model or "openai/clip-vit-large-patch14"
+        lora_r = 16
+        lora_alpha = 32
+        freeze_layers = 12
+
+    logger.info(f"Base model: {base_model}")
+    logger.info(f"LoRA rank: {lora_r}")
+    logger.info(f"Freeze layers: {freeze_layers}")
+
+    # Create model with same architecture
+    logger.info("Creating model architecture...")
+    model, processor = create_model(
+        base_model=base_model,
+        lora_r=lora_r,
+        lora_alpha=lora_alpha,
+        freeze_layers=freeze_layers,
+        use_gradient_checkpointing=False,  # Not needed for export
+    )
+
+    # Load weights
+    logger.info("Loading trained weights...")
+    model.load_state_dict(checkpoint["model_state_dict"])
+
+    # Merge LoRA if requested
+    if args.merge_lora and model.peft_applied:
+        try:
+            logger.info("Merging LoRA weights into base model...")
+            model.vision_model = model.vision_model.merge_and_unload()
+            model.peft_applied = False
+            model.lora_r = 0
+            logger.info("LoRA weights merged successfully")
+        except Exception as e:
+            logger.warning(f"Could not merge LoRA weights: {e}")
+            logger.warning("Exporting with separate LoRA weights")
+
+    # Create output directory
+    output_path = Path(args.output)
+    output_path.mkdir(parents=True, exist_ok=True)
+
+    # Save model
+    logger.info(f"Exporting to: {output_path}")
+    model.save_pretrained(str(output_path))
+
+    # Save processor config for reference
+    processor.save_pretrained(str(output_path / "processor"))
+
+    # Save additional metadata
+    metadata = {
+        "base_model": base_model,
+        "source_checkpoint": str(checkpoint_path),
+        "training_epochs": checkpoint.get("epoch", -1) + 1,
+        "best_val_loss": checkpoint.get("best_val_loss", None),
+        "best_val_separation": checkpoint.get("best_val_separation", None),
+        "lora_merged": args.merge_lora and not model.peft_applied,
+    }
+
+    with open(output_path / "export_metadata.json", "w") as f:
+        json.dump(metadata, f, indent=2)
+
+    logger.info("\nExport complete!")
+    logger.info(f"Model saved to: {output_path}")
+    logger.info("\nTo use with DetectLogosDETR:")
+    logger.info(f"  detector = DetectLogosDETR(embedding_model='{output_path}')")
+    logger.info("\nOr with test_logo_detection.py:")
+    logger.info(f"  uv run python test_logo_detection.py -e {output_path}")
+
+
+if __name__ == "__main__":
+    main()
--- a/logo_detection_detr.py
+++ b/logo_detection_detr.py
@ -13,6 +13,7 @@ Supported embedding models:
 - DINOv2 models (facebook/dinov2-*): Self-supervised, excellent for visual similarity
 """

+import json
 import os
 import torch
 import torch.nn.functional as F
@ -100,16 +101,20 @@ class DetectLogosDETR:
            embedding_model, default_embedding_dir, "Embedding"
        )

-        # Detect model type and initialize accordingly
-        self.model_type = self._detect_model_type(embedding_model)
-        self.logger.info(f"Loading {self.model_type} embedding model: {embedding_model_path}")
+        # Check if this is a fine-tuned model
+        if self._is_finetuned_model(embedding_model_path):
+            self._load_finetuned_embedding_model(embedding_model_path)
+        else:
+            # Detect model type and initialize accordingly
+            self.model_type = self._detect_model_type(embedding_model)
+            self.logger.info(f"Loading {self.model_type} embedding model: {embedding_model_path}")

-        if self.model_type == "clip":
-            self.embedding_model = CLIPModel.from_pretrained(embedding_model_path).to(self.device)
-            self.embedding_processor = CLIPProcessor.from_pretrained(embedding_model_path)
-        else:  # dinov2 or other transformer models
-            self.embedding_model = AutoModel.from_pretrained(embedding_model_path).to(self.device)
-            self.embedding_processor = AutoImageProcessor.from_pretrained(embedding_model_path)
+            if self.model_type == "clip":
+                self.embedding_model = CLIPModel.from_pretrained(embedding_model_path).to(self.device)
+                self.embedding_processor = CLIPProcessor.from_pretrained(embedding_model_path)
+            else:  # dinov2 or other transformer models
+                self.embedding_model = AutoModel.from_pretrained(embedding_model_path).to(self.device)
+                self.embedding_processor = AutoImageProcessor.from_pretrained(embedding_model_path)

        self.logger.info("DetectLogosDETR initialization complete")

@ -124,6 +129,62 @@ class DetectLogosDETR:
            # Default to generic transformer for unknown models
            return "transformer"

+    def _is_finetuned_model(self, model_path: str) -> bool:
+        """Check if a model path points to a fine-tuned CLIP model."""
+        config_path = Path(model_path) / "config.json"
+        if config_path.exists():
+            try:
+                with open(config_path, "r") as f:
+                    config = json.load(f)
+                return config.get("model_type") == "clip_logo_finetuned"
+            except (json.JSONDecodeError, IOError):
+                pass
+        return False
+
+    def _load_finetuned_embedding_model(self, model_path: str) -> None:
+        """
+        Load a fine-tuned CLIP model from the training module.
+
+        Args:
+            model_path: Path to the fine-tuned model directory
+        """
+        # Import the fine-tuned model class
+        try:
+            from training.model import LogoFineTunedCLIP
+        except ImportError as e:
+            self.logger.error(
+                f"Cannot import training.model for fine-tuned model: {e}"
+            )
+            raise ImportError(
+                "Fine-tuned model requires the training module. "
+                "Ensure the training/ directory is in your Python path."
+            ) from e
+
+        # Load config
+        config_path = Path(model_path) / "config.json"
+        with open(config_path, "r") as f:
+            config = json.load(f)
+
+        base_model = config.get("base_model", "openai/clip-vit-large-patch14")
+
+        self.logger.info(f"Loading fine-tuned CLIP model from: {model_path}")
+        self.logger.info(f"  Base model: {base_model}")
+
+        # Load model using the from_pretrained method
+        self.embedding_model = LogoFineTunedCLIP.from_pretrained(
+            model_path,
+            base_model=base_model,
+            device=self.device,
+        )
+        self.embedding_model.eval()
+
+        # Load processor from base model
+        self.embedding_processor = CLIPProcessor.from_pretrained(base_model)
+
+        # Set model type for embedding extraction
+        self.model_type = "clip_finetuned"
+        self.logger.info("Fine-tuned CLIP model loaded successfully")
+
    def _resolve_model_path(
        self, model_name_or_path: str, default_local_dir: str, model_type: str
    ) -> str:
@ -345,7 +406,7 @@ class DetectLogosDETR:
        """
        Internal method to get embedding from PIL image.

-        Handles both CLIP and DINOv2 model types.
+        Handles CLIP, fine-tuned CLIP, and DINOv2 model types.

        Args:
            pil_image: PIL Image (RGB format)
@ -360,6 +421,9 @@ class DetectLogosDETR:
            if self.model_type == "clip":
                # CLIP has a dedicated method for image features
                features = self.embedding_model.get_image_features(**inputs)
+            elif self.model_type == "clip_finetuned":
+                # Fine-tuned CLIP uses get_image_features or forward with pixel_values
+                features = self.embedding_model.get_image_features(**inputs)
            else:
                # DINOv2 and other transformers use the CLS token or pooled output
                outputs = self.embedding_model(**inputs)
@ -370,8 +434,9 @@ class DetectLogosDETR:
                    # Use CLS token from last_hidden_state
                    features = outputs.last_hidden_state[:, 0, :]

-            # Normalize for cosine similarity
-            features = F.normalize(features, dim=-1)
+            # Normalize for cosine similarity (fine-tuned model already normalizes)
+            if self.model_type != "clip_finetuned":
+                features = F.normalize(features, dim=-1)

        return features

--- a/pyproject.toml
+++ b/pyproject.toml
@ -12,4 +12,7 @@ dependencies = [
    "tqdm>=4.67.1",
    "transformers>=4.57.3",
    "typing>=3.10.0.0",
+    "peft>=0.7.0",
+    "pyyaml>=6.0",
+    "torchvision>=0.20.0",
 ]
--- a/train_clip_logo.py
+++ b/train_clip_logo.py
@ -0,0 +1,309 @@
+#!/usr/bin/env python3
+"""
+Fine-tune CLIP vision encoder for logo recognition.
+
+This script trains a CLIP model using contrastive learning on the LogoDet-3K
+dataset to improve logo embedding quality for similarity-based matching.
+
+Usage:
+    # Train with YAML config
+    uv run python train_clip_logo.py --config configs/jetson_orin.yaml
+
+    # Train with command-line overrides
+    uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
+        --learning-rate 5e-6 --max-epochs 30
+
+    # Resume from checkpoint
+    uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
+        --resume checkpoints/epoch_10.pt
+"""
+
+import argparse
+import logging
+import random
+import sys
+from pathlib import Path
+
+import numpy as np
+import torch
+
+from training.config import TrainingConfig
+from training.dataset import create_dataloaders
+from training.model import create_model
+from training.trainer import Trainer
+
+
+def setup_logging(log_level: str = "INFO") -> logging.Logger:
+    """Configure logging."""
+    logging.basicConfig(
+        level=getattr(logging, log_level.upper()),
+        format="%(asctime)s [%(levelname)s] %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+    return logging.getLogger(__name__)
+
+
+def set_seed(seed: int) -> None:
+    """Set random seeds for reproducibility."""
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(seed)
+
+
+def parse_args() -> argparse.Namespace:
+    """Parse command-line arguments."""
+    parser = argparse.ArgumentParser(
+        description="Fine-tune CLIP for logo recognition",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+
+    # Config file
+    parser.add_argument(
+        "--config",
+        type=str,
+        help="Path to YAML configuration file",
+    )
+
+    # Dataset paths
+    parser.add_argument(
+        "--dataset-dir",
+        type=str,
+        help="Path to LogoDet-3K dataset",
+    )
+    parser.add_argument(
+        "--reference-dir",
+        type=str,
+        help="Path to reference logos directory",
+    )
+    parser.add_argument(
+        "--db-path",
+        type=str,
+        help="Path to SQLite database",
+    )
+
+    # Model
+    parser.add_argument(
+        "--base-model",
+        type=str,
+        help="Base CLIP model name or path",
+    )
+    parser.add_argument(
+        "--lora-r",
+        type=int,
+        help="LoRA rank (0 to disable)",
+    )
+    parser.add_argument(
+        "--freeze-layers",
+        type=int,
+        help="Number of transformer layers to freeze",
+    )
+
+    # Training
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        help="Batch size",
+    )
+    parser.add_argument(
+        "--learning-rate",
+        type=float,
+        help="Learning rate",
+    )
+    parser.add_argument(
+        "--max-epochs",
+        type=int,
+        help="Maximum number of epochs",
+    )
+    parser.add_argument(
+        "--gradient-accumulation-steps",
+        type=int,
+        help="Gradient accumulation steps",
+    )
+
+    # Loss
+    parser.add_argument(
+        "--temperature",
+        type=float,
+        help="Temperature for InfoNCE loss",
+    )
+    parser.add_argument(
+        "--loss-type",
+        choices=["infonce", "supcon", "triplet", "combined"],
+        help="Loss function type",
+    )
+
+    # Checkpointing
+    parser.add_argument(
+        "--checkpoint-dir",
+        type=str,
+        help="Directory for checkpoints",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        help="Directory for final model output",
+    )
+    parser.add_argument(
+        "--resume",
+        type=str,
+        help="Path to checkpoint to resume from",
+    )
+
+    # Other
+    parser.add_argument(
+        "--seed",
+        type=int,
+        help="Random seed",
+    )
+    parser.add_argument(
+        "--log-level",
+        type=str,
+        default="INFO",
+        choices=["DEBUG", "INFO", "WARNING", "ERROR"],
+        help="Logging level",
+    )
+    parser.add_argument(
+        "--no-mixed-precision",
+        action="store_true",
+        help="Disable mixed precision training",
+    )
+
+    return parser.parse_args()
+
+
+def main():
+    """Main training entry point."""
+    args = parse_args()
+
+    # Setup logging
+    logger = setup_logging(args.log_level)
+    logger.info("CLIP Logo Fine-Tuning")
+    logger.info("=" * 60)
+
+    # Load or create configuration
+    if args.config:
+        logger.info(f"Loading config from: {args.config}")
+        config = TrainingConfig.from_yaml(args.config)
+    else:
+        logger.info("Using default configuration")
+        config = TrainingConfig()
+
+    # Apply command-line overrides
+    override_fields = [
+        "dataset_dir", "reference_dir", "db_path", "base_model",
+        "lora_r", "freeze_layers", "batch_size", "learning_rate",
+        "max_epochs", "gradient_accumulation_steps", "temperature",
+        "loss_type", "checkpoint_dir", "output_dir", "seed",
+    ]
+    for field in override_fields:
+        arg_name = field.replace("_", "-")
+        arg_value = getattr(args, field.replace("-", "_"), None)
+        if arg_value is not None:
+            setattr(config, field, arg_value)
+            logger.info(f"Override: {field} = {arg_value}")
+
+    if args.no_mixed_precision:
+        config.mixed_precision = False
+        logger.info("Override: mixed_precision = False")
+
+    # Validate configuration
+    warnings = config.validate()
+    for warning in warnings:
+        logger.warning(f"Config warning: {warning}")
+
+    # Set random seed
+    set_seed(config.seed)
+    logger.info(f"Random seed: {config.seed}")
+
+    # Check paths exist
+    db_path = Path(config.db_path)
+    ref_dir = Path(config.reference_dir)
+
+    if not db_path.exists():
+        logger.error(f"Database not found: {db_path}")
+        logger.error("Run prepare_test_data.py first to create the database.")
+        sys.exit(1)
+
+    if not ref_dir.exists():
+        logger.error(f"Reference directory not found: {ref_dir}")
+        logger.error("Run prepare_test_data.py first to extract reference logos.")
+        sys.exit(1)
+
+    # Create model
+    logger.info(f"Creating model from: {config.base_model}")
+    model, processor = create_model(
+        base_model=config.base_model,
+        lora_r=config.lora_r,
+        lora_alpha=config.lora_alpha,
+        lora_dropout=config.lora_dropout,
+        freeze_layers=config.freeze_layers,
+        use_gradient_checkpointing=config.use_gradient_checkpointing,
+    )
+
+    # Create dataloaders
+    logger.info("Creating dataloaders...")
+    train_loader, val_loader, test_loader = create_dataloaders(
+        db_path=str(config.db_path),
+        reference_dir=str(config.reference_dir),
+        batch_size=config.batch_size,
+        logos_per_batch=config.logos_per_batch,
+        samples_per_logo=config.samples_per_logo,
+        num_workers=config.num_workers,
+        train_split=config.train_split,
+        val_split=config.val_split,
+        test_split=config.test_split,
+        seed=config.seed,
+        augmentation_strength=config.augmentation_strength,
+    )
+
+    # Create trainer
+    trainer = Trainer(
+        model=model,
+        train_loader=train_loader,
+        val_loader=val_loader,
+        config=config,
+        logger=logger,
+    )
+
+    # Resume from checkpoint if specified
+    if args.resume:
+        resume_path = Path(args.resume)
+        if resume_path.exists():
+            logger.info(f"Resuming from: {resume_path}")
+            # Set checkpoint dir to resume path's parent
+            if resume_path.is_file():
+                config.checkpoint_dir = str(resume_path.parent)
+                trainer.load_checkpoint(resume_path.name)
+        else:
+            logger.warning(f"Resume checkpoint not found: {resume_path}")
+
+    # Train
+    logger.info("\nStarting training...")
+    final_metrics = trainer.train()
+
+    logger.info("\nTraining complete!")
+    logger.info(f"  Best val loss: {final_metrics['best_val_loss']:.4f}")
+    logger.info(f"  Best separation: {final_metrics['best_val_separation']:.4f}")
+    logger.info(f"  Total epochs: {final_metrics['total_epochs']}")
+    logger.info(f"  Total time: {final_metrics['total_time_minutes']:.1f} minutes")
+
+    # Export model
+    output_path = trainer.export_model()
+    logger.info(f"\nModel exported to: {output_path}")
+
+    # Print next steps
+    logger.info("\n" + "=" * 60)
+    logger.info("Next steps:")
+    logger.info(f"1. Test the fine-tuned model:")
+    logger.info(f"   uv run python test_logo_detection.py -n 50 \\")
+    logger.info(f"       -e {output_path} --matching-method multi-ref")
+    logger.info(f"")
+    logger.info(f"2. Compare with baseline:")
+    logger.info(f"   uv run python test_logo_detection.py -n 50 \\")
+    logger.info(f"       -e openai/clip-vit-large-patch14 --matching-method multi-ref")
+
+
+if __name__ == "__main__":
+    main()
--- a/training/init.py
+++ b/training/init.py
@ -0,0 +1,24 @@
+"""
+CLIP fine-tuning module for logo recognition.
+
+This module provides tools for fine-tuning CLIP's vision encoder using
+contrastive learning on the LogoDet-3K dataset.
+"""
+
+from .config import TrainingConfig
+from .dataset import LogoContrastiveDataset, create_dataloaders
+from .model import LogoFineTunedCLIP
+from .losses import InfoNCELoss, TripletLoss
+from .trainer import Trainer
+from .evaluation import EmbeddingEvaluator
+
+__all__ = [
+    "TrainingConfig",
+    "LogoContrastiveDataset",
+    "create_dataloaders",
+    "LogoFineTunedCLIP",
+    "InfoNCELoss",
+    "TripletLoss",
+    "Trainer",
+    "EmbeddingEvaluator",
+]
--- a/training/config.py
+++ b/training/config.py
@ -0,0 +1,141 @@
+"""
+Training configuration for CLIP fine-tuning.
+"""
+
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import List, Optional
+import yaml
+
+
+@dataclass
+class TrainingConfig:
+    """Configuration for CLIP logo fine-tuning."""
+
+    # Base model
+    base_model: str = "openai/clip-vit-large-patch14"
+
+    # Dataset paths
+    dataset_dir: str = "LogoDet-3K"
+    reference_dir: str = "reference_logos"
+    db_path: str = "test_data_mapping.db"
+
+    # Data split ratios
+    train_split: float = 0.7
+    val_split: float = 0.15
+    test_split: float = 0.15
+
+    # Batch construction
+    batch_size: int = 16
+    logos_per_batch: int = 32
+    samples_per_logo: int = 4
+    gradient_accumulation_steps: int = 8
+    num_workers: int = 4
+
+    # Model architecture
+    lora_r: int = 16
+    lora_alpha: int = 32
+    lora_dropout: float = 0.1
+    freeze_layers: int = 12
+    use_gradient_checkpointing: bool = True
+
+    # Training hyperparameters
+    learning_rate: float = 1e-5
+    weight_decay: float = 0.01
+    warmup_steps: int = 500
+    max_epochs: int = 20
+    mixed_precision: bool = True
+
+    # Loss function
+    temperature: float = 0.07
+    loss_type: str = "infonce"  # "infonce" or "triplet"
+    triplet_margin: float = 0.3
+
+    # Early stopping
+    patience: int = 5
+    min_delta: float = 0.001
+
+    # Checkpoints and output
+    checkpoint_dir: str = "checkpoints"
+    output_dir: str = "models/logo_detection/clip_finetuned"
+    save_every_n_epochs: int = 5
+
+    # Logging
+    log_every_n_steps: int = 10
+    eval_every_n_epochs: int = 1
+
+    # Random seed for reproducibility
+    seed: int = 42
+
+    # Hard negative mining
+    use_hard_negatives: bool = False
+    hard_negative_start_epoch: int = 5
+    hard_negatives_per_logo: int = 10
+
+    # Data augmentation
+    use_augmentation: bool = True
+    augmentation_strength: str = "medium"  # "light", "medium", "strong"
+
+    @classmethod
+    def from_yaml(cls, yaml_path: str) -> "TrainingConfig":
+        """Load configuration from YAML file."""
+        with open(yaml_path, "r") as f:
+            config_dict = yaml.safe_load(f)
+        return cls(**config_dict)
+
+    def to_yaml(self, yaml_path: str) -> None:
+        """Save configuration to YAML file."""
+        Path(yaml_path).parent.mkdir(parents=True, exist_ok=True)
+        with open(yaml_path, "w") as f:
+            yaml.dump(self.__dict__, f, default_flow_style=False, sort_keys=False)
+
+    def validate(self) -> List[str]:
+        """Validate configuration and return list of warnings."""
+        warnings = []
+
+        # Check split ratios
+        total_split = self.train_split + self.val_split + self.test_split
+        if abs(total_split - 1.0) > 0.01:
+            warnings.append(
+                f"Split ratios sum to {total_split}, expected 1.0"
+            )
+
+        # Check batch construction
+        effective_batch = self.batch_size * self.gradient_accumulation_steps
+        if effective_batch < 64:
+            warnings.append(
+                f"Effective batch size ({effective_batch}) is small for contrastive learning. "
+                "Consider increasing batch_size or gradient_accumulation_steps."
+            )
+
+        # Check LoRA config
+        if self.lora_r > 0 and self.lora_alpha < self.lora_r:
+            warnings.append(
+                f"lora_alpha ({self.lora_alpha}) < lora_r ({self.lora_r}). "
+                "This may reduce LoRA effectiveness."
+            )
+
+        # Check freeze layers
+        if self.freeze_layers < 0:
+            warnings.append("freeze_layers should be >= 0")
+
+        # Check temperature
+        if self.temperature <= 0:
+            warnings.append("temperature must be positive")
+        elif self.temperature > 1.0:
+            warnings.append(
+                f"temperature ({self.temperature}) is high. "
+                "Typical values are 0.05-0.1."
+            )
+
+        return warnings
+
+    @property
+    def effective_batch_size(self) -> int:
+        """Calculate effective batch size with gradient accumulation."""
+        return self.batch_size * self.gradient_accumulation_steps
+
+    @property
+    def samples_per_batch(self) -> int:
+        """Total samples in one batch (logos_per_batch * samples_per_logo)."""
+        return self.logos_per_batch * self.samples_per_logo
--- a/training/dataset.py
+++ b/training/dataset.py
@ -0,0 +1,467 @@
+"""
+Dataset classes for contrastive learning on logo images.
+"""
+
+import random
+import sqlite3
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple
+
+import torch
+from PIL import Image
+from torch.utils.data import Dataset, DataLoader, Sampler
+from torchvision import transforms
+
+
+# CLIP normalization values
+CLIP_MEAN = [0.48145466, 0.4578275, 0.40821073]
+CLIP_STD = [0.26862954, 0.26130258, 0.27577711]
+
+
+def get_train_transforms(strength: str = "medium") -> transforms.Compose:
+    """
+    Get training data augmentation transforms.
+
+    Args:
+        strength: Augmentation strength - "light", "medium", or "strong"
+
+    Returns:
+        Composed transforms for training
+    """
+    if strength == "light":
+        return transforms.Compose([
+            transforms.Resize((224, 224)),
+            transforms.RandomHorizontalFlip(p=0.5),
+            transforms.ColorJitter(brightness=0.1, contrast=0.1),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=CLIP_MEAN, std=CLIP_STD),
+        ])
+    elif strength == "medium":
+        return transforms.Compose([
+            transforms.Resize((224, 224)),
+            transforms.RandomHorizontalFlip(p=0.5),
+            transforms.RandomRotation(degrees=15),
+            transforms.ColorJitter(
+                brightness=0.2, contrast=0.2, saturation=0.2, hue=0.05
+            ),
+            transforms.RandomAffine(
+                degrees=0, translate=(0.1, 0.1), scale=(0.9, 1.1)
+            ),
+            transforms.RandomGrayscale(p=0.1),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=CLIP_MEAN, std=CLIP_STD),
+        ])
+    else:  # strong
+        return transforms.Compose([
+            transforms.Resize((256, 256)),
+            transforms.RandomCrop(224),
+            transforms.RandomHorizontalFlip(p=0.5),
+            transforms.RandomVerticalFlip(p=0.1),
+            transforms.RandomRotation(degrees=30),
+            transforms.ColorJitter(
+                brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1
+            ),
+            transforms.RandomAffine(
+                degrees=0, translate=(0.15, 0.15), scale=(0.8, 1.2), shear=10
+            ),
+            transforms.RandomGrayscale(p=0.2),
+            transforms.GaussianBlur(kernel_size=3, sigma=(0.1, 2.0)),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=CLIP_MEAN, std=CLIP_STD),
+        ])
+
+
+def get_val_transforms() -> transforms.Compose:
+    """Get validation/test transforms (no augmentation)."""
+    return transforms.Compose([
+        transforms.Resize((224, 224)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=CLIP_MEAN, std=CLIP_STD),
+    ])
+
+
+class LogoDataset:
+    """
+    Manages logo data from the SQLite database.
+
+    Handles loading logo-to-image mappings and splitting by logo brand.
+    """
+
+    def __init__(
+        self,
+        db_path: str,
+        reference_dir: str,
+        train_split: float = 0.7,
+        val_split: float = 0.15,
+        test_split: float = 0.15,
+        seed: int = 42,
+    ):
+        self.db_path = Path(db_path)
+        self.reference_dir = Path(reference_dir)
+        self.seed = seed
+
+        # Load logo-to-images mapping from database
+        self.logo_to_images = self._load_logo_mappings()
+        self.all_logos = list(self.logo_to_images.keys())
+
+        # Create logo-level splits
+        self.train_logos, self.val_logos, self.test_logos = self._split_logos(
+            train_split, val_split, test_split
+        )
+
+    def _load_logo_mappings(self) -> Dict[str, List[Path]]:
+        """Load logo name to image paths mapping from database."""
+        conn = sqlite3.connect(self.db_path)
+        cursor = conn.cursor()
+
+        cursor.execute("""
+            SELECT ln.name, rl.filename
+            FROM reference_logos rl
+            JOIN logo_names ln ON rl.logo_name_id = ln.id
+            ORDER BY ln.name
+        """)
+
+        logo_to_images: Dict[str, List[Path]] = {}
+        for logo_name, filename in cursor.fetchall():
+            if logo_name not in logo_to_images:
+                logo_to_images[logo_name] = []
+            logo_to_images[logo_name].append(self.reference_dir / filename)
+
+        conn.close()
+        return logo_to_images
+
+    def _split_logos(
+        self,
+        train_split: float,
+        val_split: float,
+        test_split: float,
+    ) -> Tuple[List[str], List[str], List[str]]:
+        """Split logos at brand level for train/val/test."""
+        random.seed(self.seed)
+        logos = self.all_logos.copy()
+        random.shuffle(logos)
+
+        n = len(logos)
+        train_end = int(n * train_split)
+        val_end = train_end + int(n * val_split)
+
+        train_logos = logos[:train_end]
+        val_logos = logos[train_end:val_end]
+        test_logos = logos[val_end:]
+
+        return train_logos, val_logos, test_logos
+
+    def get_split_info(self) -> Dict[str, int]:
+        """Return information about the splits."""
+        return {
+            "total_logos": len(self.all_logos),
+            "train_logos": len(self.train_logos),
+            "val_logos": len(self.val_logos),
+            "test_logos": len(self.test_logos),
+            "train_images": sum(
+                len(self.logo_to_images[l]) for l in self.train_logos
+            ),
+            "val_images": sum(
+                len(self.logo_to_images[l]) for l in self.val_logos
+            ),
+            "test_images": sum(
+                len(self.logo_to_images[l]) for l in self.test_logos
+            ),
+        }
+
+
+class LogoContrastiveDataset(Dataset):
+    """
+    Dataset for contrastive learning on logos.
+
+    Each __getitem__ call returns a batch of images organized for contrastive
+    learning: K different logos with M samples each, ensuring positive pairs
+    exist within each batch.
+    """
+
+    def __init__(
+        self,
+        logo_data: LogoDataset,
+        split: str = "train",
+        logos_per_batch: int = 32,
+        samples_per_logo: int = 4,
+        transform: Optional[transforms.Compose] = None,
+        batches_per_epoch: int = 1000,
+    ):
+        """
+        Initialize the contrastive dataset.
+
+        Args:
+            logo_data: LogoDataset instance with logo mappings
+            split: One of "train", "val", or "test"
+            logos_per_batch: Number of different logos per batch
+            samples_per_logo: Number of samples for each logo
+            transform: Image transforms to apply
+            batches_per_epoch: Number of batches per epoch
+        """
+        self.logo_data = logo_data
+        self.logos_per_batch = logos_per_batch
+        self.samples_per_logo = samples_per_logo
+        self.transform = transform
+        self.batches_per_epoch = batches_per_epoch
+
+        # Get logos for this split
+        if split == "train":
+            self.logos = logo_data.train_logos
+        elif split == "val":
+            self.logos = logo_data.val_logos
+        else:
+            self.logos = logo_data.test_logos
+
+        # Filter logos with enough samples
+        self.valid_logos = [
+            logo for logo in self.logos
+            if len(logo_data.logo_to_images[logo]) >= samples_per_logo
+        ]
+
+        # For logos with fewer samples, we'll use with replacement
+        self.logos_needing_replacement = [
+            logo for logo in self.logos
+            if len(logo_data.logo_to_images[logo]) < samples_per_logo
+        ]
+
+        # Create label mapping
+        self.logo_to_label = {
+            logo: idx for idx, logo in enumerate(self.logos)
+        }
+
+    def __len__(self) -> int:
+        return self.batches_per_epoch
+
+    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Get a batch of images for contrastive learning.
+
+        Returns:
+            images: Tensor of shape [K*M, 3, 224, 224]
+            labels: Tensor of shape [K*M] with logo class indices
+        """
+        images = []
+        labels = []
+
+        # Sample K logos for this batch
+        k = min(self.logos_per_batch, len(self.logos))
+        batch_logos = random.sample(self.logos, k)
+
+        for logo in batch_logos:
+            logo_images = self.logo_data.logo_to_images[logo]
+
+            # Sample M images for this logo
+            if len(logo_images) >= self.samples_per_logo:
+                sampled_paths = random.sample(logo_images, self.samples_per_logo)
+            else:
+                # Sample with replacement if not enough images
+                sampled_paths = random.choices(
+                    logo_images, k=self.samples_per_logo
+                )
+
+            # Load and transform images
+            for img_path in sampled_paths:
+                try:
+                    img = Image.open(img_path).convert("RGB")
+                    if self.transform:
+                        img = self.transform(img)
+                    else:
+                        img = get_val_transforms()(img)
+                    images.append(img)
+                    labels.append(self.logo_to_label[logo])
+                except Exception as e:
+                    # Skip problematic images, sample another
+                    continue
+
+        # Stack into tensors
+        if len(images) == 0:
+            # Fallback: return dummy batch
+            return (
+                torch.zeros(1, 3, 224, 224),
+                torch.zeros(1, dtype=torch.long),
+            )
+
+        images_tensor = torch.stack(images)
+        labels_tensor = torch.tensor(labels, dtype=torch.long)
+
+        return images_tensor, labels_tensor
+
+
+class BalancedBatchSampler(Sampler):
+    """
+    Sampler that ensures each batch has a balanced distribution of logos.
+
+    Used with a flattened dataset where each sample is a single image.
+    """
+
+    def __init__(
+        self,
+        logo_labels: List[int],
+        logos_per_batch: int,
+        samples_per_logo: int,
+        num_batches: int,
+    ):
+        self.logo_labels = logo_labels
+        self.logos_per_batch = logos_per_batch
+        self.samples_per_logo = samples_per_logo
+        self.num_batches = num_batches
+
+        # Group indices by logo
+        self.logo_to_indices: Dict[int, List[int]] = {}
+        for idx, label in enumerate(logo_labels):
+            if label not in self.logo_to_indices:
+                self.logo_to_indices[label] = []
+            self.logo_to_indices[label].append(idx)
+
+        self.all_logos = list(self.logo_to_indices.keys())
+
+    def __iter__(self):
+        for _ in range(self.num_batches):
+            batch_indices = []
+
+            # Sample logos for this batch
+            logos = random.sample(
+                self.all_logos,
+                min(self.logos_per_batch, len(self.all_logos)),
+            )
+
+            for logo in logos:
+                indices = self.logo_to_indices[logo]
+                if len(indices) >= self.samples_per_logo:
+                    sampled = random.sample(indices, self.samples_per_logo)
+                else:
+                    sampled = random.choices(indices, k=self.samples_per_logo)
+                batch_indices.extend(sampled)
+
+            yield batch_indices
+
+    def __len__(self):
+        return self.num_batches
+
+
+def create_dataloaders(
+    db_path: str,
+    reference_dir: str,
+    batch_size: int = 16,
+    logos_per_batch: int = 32,
+    samples_per_logo: int = 4,
+    num_workers: int = 4,
+    train_split: float = 0.7,
+    val_split: float = 0.15,
+    test_split: float = 0.15,
+    seed: int = 42,
+    augmentation_strength: str = "medium",
+    batches_per_epoch: int = 1000,
+) -> Tuple[DataLoader, DataLoader, Optional[DataLoader]]:
+    """
+    Create train, validation, and optionally test dataloaders.
+
+    Args:
+        db_path: Path to SQLite database
+        reference_dir: Directory containing reference logo images
+        batch_size: Not used directly (see logos_per_batch and samples_per_logo)
+        logos_per_batch: Number of different logos per batch
+        samples_per_logo: Samples per logo in batch
+        num_workers: Number of data loading workers
+        train_split: Fraction for training
+        val_split: Fraction for validation
+        test_split: Fraction for testing
+        seed: Random seed
+        augmentation_strength: "light", "medium", or "strong"
+        batches_per_epoch: Number of batches per training epoch
+
+    Returns:
+        Tuple of (train_loader, val_loader, test_loader)
+    """
+    # Load logo data
+    logo_data = LogoDataset(
+        db_path=db_path,
+        reference_dir=reference_dir,
+        train_split=train_split,
+        val_split=val_split,
+        test_split=test_split,
+        seed=seed,
+    )
+
+    # Print split info
+    split_info = logo_data.get_split_info()
+    print(f"Dataset loaded:")
+    print(f"  Total logos: {split_info['total_logos']}")
+    print(f"  Train: {split_info['train_logos']} logos, {split_info['train_images']} images")
+    print(f"  Val: {split_info['val_logos']} logos, {split_info['val_images']} images")
+    print(f"  Test: {split_info['test_logos']} logos, {split_info['test_images']} images")
+
+    # Create datasets
+    train_dataset = LogoContrastiveDataset(
+        logo_data=logo_data,
+        split="train",
+        logos_per_batch=logos_per_batch,
+        samples_per_logo=samples_per_logo,
+        transform=get_train_transforms(augmentation_strength),
+        batches_per_epoch=batches_per_epoch,
+    )
+
+    val_dataset = LogoContrastiveDataset(
+        logo_data=logo_data,
+        split="val",
+        logos_per_batch=logos_per_batch,
+        samples_per_logo=samples_per_logo,
+        transform=get_val_transforms(),
+        batches_per_epoch=batches_per_epoch // 10,  # Fewer val batches
+    )
+
+    test_dataset = LogoContrastiveDataset(
+        logo_data=logo_data,
+        split="test",
+        logos_per_batch=logos_per_batch,
+        samples_per_logo=samples_per_logo,
+        transform=get_val_transforms(),
+        batches_per_epoch=batches_per_epoch // 10,
+    ) if test_split > 0 else None
+
+    # Create dataloaders
+    # Note: batch_size=1 because each __getitem__ already returns a batch
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=1,
+        shuffle=True,
+        num_workers=num_workers,
+        pin_memory=True,
+        collate_fn=_collate_contrastive_batch,
+    )
+
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=1,
+        shuffle=False,
+        num_workers=num_workers,
+        pin_memory=True,
+        collate_fn=_collate_contrastive_batch,
+    )
+
+    test_loader = None
+    if test_dataset is not None:
+        test_loader = DataLoader(
+            test_dataset,
+            batch_size=1,
+            shuffle=False,
+            num_workers=num_workers,
+            pin_memory=True,
+            collate_fn=_collate_contrastive_batch,
+        )
+
+    return train_loader, val_loader, test_loader
+
+
+def _collate_contrastive_batch(
+    batch: List[Tuple[torch.Tensor, torch.Tensor]]
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Collate function that unpacks pre-batched data.
+
+    Since LogoContrastiveDataset already returns batched data,
+    we just squeeze the outer dimension.
+    """
+    images, labels = batch[0]
+    return images, labels
--- a/training/evaluation.py
+++ b/training/evaluation.py
@ -0,0 +1,339 @@
+"""
+Evaluation metrics for embedding quality.
+"""
+
+from typing import Dict, List, Optional, Tuple
+
+import torch
+import torch.nn.functional as F
+import numpy as np
+
+
+class EmbeddingEvaluator:
+    """
+    Evaluator for embedding quality metrics.
+
+    Computes metrics that indicate how well the embeddings
+    separate different logo classes.
+    """
+
+    def compute_metrics(
+        self,
+        embeddings: torch.Tensor,
+        labels: torch.Tensor,
+    ) -> Dict[str, float]:
+        """
+        Compute embedding quality metrics.
+
+        Args:
+            embeddings: [N, D] L2-normalized embeddings
+            labels: [N] integer class labels
+
+        Returns:
+            Dict with metric names and values
+        """
+        device = embeddings.device
+        batch_size = embeddings.shape[0]
+
+        if batch_size <= 1:
+            return {
+                "mean_pos_sim": 0.0,
+                "mean_neg_sim": 0.0,
+                "separation": 0.0,
+                "recall_at_1": 0.0,
+                "recall_at_5": 0.0,
+            }
+
+        # Compute similarity matrix
+        similarity = embeddings @ embeddings.T
+
+        # Create masks
+        labels_col = labels.unsqueeze(0)
+        labels_row = labels.unsqueeze(1)
+        positive_mask = (labels_row == labels_col).float()
+        negative_mask = 1 - positive_mask
+
+        # Remove diagonal from positive mask
+        identity = torch.eye(batch_size, device=device)
+        positive_mask = positive_mask - identity
+
+        # Count pairs
+        num_positives = positive_mask.sum()
+        num_negatives = negative_mask.sum()
+
+        # Mean positive similarity (excluding self)
+        if num_positives > 0:
+            pos_sims = (similarity * positive_mask).sum() / num_positives
+            mean_pos_sim = pos_sims.item()
+        else:
+            mean_pos_sim = 0.0
+
+        # Mean negative similarity
+        if num_negatives > 0:
+            neg_sims = (similarity * negative_mask).sum() / num_negatives
+            mean_neg_sim = neg_sims.item()
+        else:
+            mean_neg_sim = 0.0
+
+        # Separation: gap between positive and negative similarity
+        separation = mean_pos_sim - mean_neg_sim
+
+        # Recall@K metrics
+        recall_at_1 = self._compute_recall_at_k(similarity, labels, k=1)
+        recall_at_5 = self._compute_recall_at_k(similarity, labels, k=5)
+
+        return {
+            "mean_pos_sim": mean_pos_sim,
+            "mean_neg_sim": mean_neg_sim,
+            "separation": separation,
+            "recall_at_1": recall_at_1,
+            "recall_at_5": recall_at_5,
+        }
+
+    def _compute_recall_at_k(
+        self,
+        similarity: torch.Tensor,
+        labels: torch.Tensor,
+        k: int = 1,
+    ) -> float:
+        """
+        Compute Recall@K for nearest neighbor retrieval.
+
+        For each sample, check if the k nearest neighbors (excluding self)
+        contain at least one sample with the same label.
+
+        Args:
+            similarity: [N, N] similarity matrix
+            labels: [N] class labels
+            k: Number of neighbors to consider
+
+        Returns:
+            Recall@K score (0 to 1)
+        """
+        batch_size = similarity.shape[0]
+        if batch_size <= 1:
+            return 0.0
+
+        # Mask out self-similarity
+        similarity = similarity.clone()
+        similarity.fill_diagonal_(float("-inf"))
+
+        # Get top-k indices
+        _, top_k_indices = similarity.topk(min(k, batch_size - 1), dim=1)
+
+        # Check if any of top-k have same label
+        correct = 0
+        for i in range(batch_size):
+            query_label = labels[i]
+            retrieved_labels = labels[top_k_indices[i]]
+            if (retrieved_labels == query_label).any():
+                correct += 1
+
+        return correct / batch_size
+
+    def compute_detailed_metrics(
+        self,
+        embeddings: torch.Tensor,
+        labels: torch.Tensor,
+        label_names: Optional[List[str]] = None,
+    ) -> Dict:
+        """
+        Compute detailed per-class metrics.
+
+        Args:
+            embeddings: [N, D] embeddings
+            labels: [N] class labels
+            label_names: Optional list of label names
+
+        Returns:
+            Dict with detailed metrics including per-class stats
+        """
+        basic_metrics = self.compute_metrics(embeddings, labels)
+
+        # Per-class statistics
+        unique_labels = labels.unique()
+        per_class_stats = {}
+
+        similarity = embeddings @ embeddings.T
+
+        for label in unique_labels:
+            mask = labels == label
+            class_embeddings = embeddings[mask]
+            class_size = mask.sum().item()
+
+            if class_size > 1:
+                # Intra-class similarity
+                class_sim = class_embeddings @ class_embeddings.T
+                # Exclude diagonal
+                mask_diag = ~torch.eye(class_size, dtype=torch.bool, device=class_sim.device)
+                intra_sim = class_sim[mask_diag].mean().item()
+            else:
+                intra_sim = 1.0
+
+            # Inter-class similarity (to other classes)
+            other_mask = labels != label
+            if other_mask.any():
+                inter_sim = similarity[mask][:, other_mask].mean().item()
+            else:
+                inter_sim = 0.0
+
+            class_name = label_names[label.item()] if label_names else str(label.item())
+            per_class_stats[class_name] = {
+                "size": class_size,
+                "intra_class_sim": intra_sim,
+                "inter_class_sim": inter_sim,
+                "class_separation": intra_sim - inter_sim,
+            }
+
+        # Aggregate per-class stats
+        if per_class_stats:
+            separations = [s["class_separation"] for s in per_class_stats.values()]
+            min_separation = min(separations)
+            max_separation = max(separations)
+            std_separation = np.std(separations)
+        else:
+            min_separation = max_separation = std_separation = 0.0
+
+        return {
+            **basic_metrics,
+            "per_class": per_class_stats,
+            "min_class_separation": min_separation,
+            "max_class_separation": max_separation,
+            "std_class_separation": std_separation,
+        }
+
+
+class SimilarityAnalyzer:
+    """
+    Analyze similarity distributions for debugging and tuning.
+    """
+
+    @staticmethod
+    def analyze_similarity_distribution(
+        embeddings: torch.Tensor,
+        labels: torch.Tensor,
+    ) -> Dict[str, np.ndarray]:
+        """
+        Get similarity distributions for positive and negative pairs.
+
+        Useful for choosing appropriate thresholds.
+
+        Args:
+            embeddings: [N, D] embeddings
+            labels: [N] class labels
+
+        Returns:
+            Dict with 'positive_sims' and 'negative_sims' arrays
+        """
+        similarity = (embeddings @ embeddings.T).cpu().numpy()
+        labels_np = labels.cpu().numpy()
+
+        batch_size = len(labels_np)
+        positive_sims = []
+        negative_sims = []
+
+        for i in range(batch_size):
+            for j in range(i + 1, batch_size):
+                if labels_np[i] == labels_np[j]:
+                    positive_sims.append(similarity[i, j])
+                else:
+                    negative_sims.append(similarity[i, j])
+
+        return {
+            "positive_sims": np.array(positive_sims),
+            "negative_sims": np.array(negative_sims),
+        }
+
+    @staticmethod
+    def find_hard_pairs(
+        embeddings: torch.Tensor,
+        labels: torch.Tensor,
+        n_hard: int = 10,
+    ) -> Tuple[List[Tuple[int, int, float]], List[Tuple[int, int, float]]]:
+        """
+        Find hardest positive and negative pairs.
+
+        Hard positives: same label but low similarity
+        Hard negatives: different label but high similarity
+
+        Args:
+            embeddings: [N, D] embeddings
+            labels: [N] class labels
+            n_hard: Number of hard pairs to return
+
+        Returns:
+            Tuple of (hard_positives, hard_negatives)
+            Each is a list of (idx1, idx2, similarity) tuples
+        """
+        similarity = embeddings @ embeddings.T
+        batch_size = len(labels)
+
+        hard_positives = []  # Low similarity, same label
+        hard_negatives = []  # High similarity, different label
+
+        for i in range(batch_size):
+            for j in range(i + 1, batch_size):
+                sim = similarity[i, j].item()
+                if labels[i] == labels[j]:
+                    hard_positives.append((i, j, sim))
+                else:
+                    hard_negatives.append((i, j, sim))
+
+        # Sort: hard positives by ascending similarity (lowest first)
+        hard_positives.sort(key=lambda x: x[2])
+
+        # Sort: hard negatives by descending similarity (highest first)
+        hard_negatives.sort(key=lambda x: -x[2])
+
+        return hard_positives[:n_hard], hard_negatives[:n_hard]
+
+    @staticmethod
+    def compute_confusion_pairs(
+        embeddings: torch.Tensor,
+        labels: torch.Tensor,
+        label_names: Optional[List[str]] = None,
+        top_k: int = 10,
+    ) -> List[Dict]:
+        """
+        Find pairs of classes that are most confused (highest cross-class similarity).
+
+        Args:
+            embeddings: [N, D] embeddings
+            labels: [N] class labels
+            label_names: Optional label names
+            top_k: Number of confused pairs to return
+
+        Returns:
+            List of dicts with class pairs and their similarity
+        """
+        unique_labels = labels.unique()
+        class_centroids = {}
+
+        # Compute class centroids
+        for label in unique_labels:
+            mask = labels == label
+            centroid = embeddings[mask].mean(dim=0)
+            centroid = F.normalize(centroid, dim=0)
+            class_centroids[label.item()] = centroid
+
+        # Compute pairwise centroid similarities
+        confusions = []
+        label_list = list(class_centroids.keys())
+
+        for i, label1 in enumerate(label_list):
+            for label2 in label_list[i + 1:]:
+                sim = (class_centroids[label1] @ class_centroids[label2]).item()
+                name1 = label_names[label1] if label_names else str(label1)
+                name2 = label_names[label2] if label_names else str(label2)
+                confusions.append({
+                    "class1": name1,
+                    "class2": name2,
+                    "label1": label1,
+                    "label2": label2,
+                    "centroid_similarity": sim,
+                })
+
+        # Sort by similarity (highest first)
+        confusions.sort(key=lambda x: -x["centroid_similarity"])
+
+        return confusions[:top_k]
--- a/training/losses.py
+++ b/training/losses.py
@ -0,0 +1,326 @@
+"""
+Loss functions for contrastive learning on logo embeddings.
+"""
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional
+
+
+class InfoNCELoss(nn.Module):
+    """
+    Normalized Temperature-scaled Cross Entropy Loss (InfoNCE).
+
+    This is the contrastive loss used in CLIP training. It maximizes
+    similarity between embeddings of the same logo class while
+    minimizing similarity to embeddings of different classes.
+
+    For a batch with N samples:
+    - Each sample is an anchor
+    - Positive pairs: samples with the same label
+    - Negative pairs: samples with different labels
+
+    The loss for each anchor is:
+        -log(sum(exp(sim(anchor, pos)/temp)) / sum(exp(sim(anchor, all)/temp)))
+    """
+
+    def __init__(self, temperature: float = 0.07):
+        """
+        Initialize InfoNCE loss.
+
+        Args:
+            temperature: Scaling factor for similarities (0.05-0.1 typical).
+                Lower temperature makes the distribution sharper.
+        """
+        super().__init__()
+        self.temperature = temperature
+
+    def forward(
+        self,
+        embeddings: torch.Tensor,
+        labels: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Compute InfoNCE loss for a batch of embeddings.
+
+        Args:
+            embeddings: [N, D] L2-normalized embeddings
+            labels: [N] integer logo class labels
+
+        Returns:
+            Scalar loss value
+        """
+        device = embeddings.device
+        batch_size = embeddings.shape[0]
+
+        if batch_size <= 1:
+            return torch.tensor(0.0, device=device, requires_grad=True)
+
+        # Compute similarity matrix [N, N]
+        # Since embeddings are L2-normalized, dot product = cosine similarity
+        similarity = embeddings @ embeddings.T / self.temperature
+
+        # Create positive mask: same label = 1, different = 0
+        labels_col = labels.unsqueeze(0)  # [1, N]
+        labels_row = labels.unsqueeze(1)  # [N, 1]
+        positive_mask = (labels_row == labels_col).float()  # [N, N]
+
+        # Remove self-similarity from positives (diagonal)
+        identity = torch.eye(batch_size, device=device)
+        positive_mask = positive_mask - identity
+
+        # Count positives per anchor (avoid division by zero)
+        num_positives = positive_mask.sum(dim=1)
+        has_positives = num_positives > 0
+
+        # If no positives exist for any anchor, return zero loss
+        if not has_positives.any():
+            return torch.tensor(0.0, device=device, requires_grad=True)
+
+        # Mask out self-similarity with large negative value
+        similarity = similarity - identity * 1e9
+
+        # Compute log-softmax over similarities
+        log_softmax = F.log_softmax(similarity, dim=1)
+
+        # Sum log probabilities of positive pairs
+        positive_log_probs = (log_softmax * positive_mask).sum(dim=1)
+
+        # Average over number of positives (only for anchors with positives)
+        loss_per_anchor = torch.zeros(batch_size, device=device)
+        loss_per_anchor[has_positives] = (
+            -positive_log_probs[has_positives] / num_positives[has_positives]
+        )
+
+        return loss_per_anchor.mean()
+
+
+class SupConLoss(nn.Module):
+    """
+    Supervised Contrastive Loss.
+
+    Similar to InfoNCE but uses a different formulation that
+    considers each positive pair separately rather than averaging.
+
+    Reference: https://arxiv.org/abs/2004.11362
+    """
+
+    def __init__(self, temperature: float = 0.07):
+        super().__init__()
+        self.temperature = temperature
+
+    def forward(
+        self,
+        embeddings: torch.Tensor,
+        labels: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Compute Supervised Contrastive loss.
+
+        Args:
+            embeddings: [N, D] L2-normalized embeddings
+            labels: [N] integer logo class labels
+
+        Returns:
+            Scalar loss value
+        """
+        device = embeddings.device
+        batch_size = embeddings.shape[0]
+
+        if batch_size <= 1:
+            return torch.tensor(0.0, device=device, requires_grad=True)
+
+        # Compute similarity matrix
+        similarity = embeddings @ embeddings.T / self.temperature
+
+        # Create masks
+        labels_col = labels.unsqueeze(0)
+        labels_row = labels.unsqueeze(1)
+        positive_mask = (labels_row == labels_col).float()
+        identity = torch.eye(batch_size, device=device)
+
+        # Remove self from positives
+        positive_mask = positive_mask - identity
+
+        # Number of positives per anchor
+        num_positives = positive_mask.sum(dim=1)
+        has_positives = num_positives > 0
+
+        if not has_positives.any():
+            return torch.tensor(0.0, device=device, requires_grad=True)
+
+        # For numerical stability, subtract max similarity
+        sim_max, _ = similarity.max(dim=1, keepdim=True)
+        similarity = similarity - sim_max.detach()
+
+        # Compute exp(similarity) with self masked out
+        exp_sim = torch.exp(similarity) * (1 - identity)
+
+        # Denominator: sum of exp over all pairs except self
+        log_prob = similarity - torch.log(exp_sim.sum(dim=1, keepdim=True) + 1e-8)
+
+        # Mean of log-prob over positive pairs
+        mean_log_prob_pos = (positive_mask * log_prob).sum(dim=1) / (
+            num_positives + 1e-8
+        )
+
+        # Loss is negative mean log probability
+        loss = -mean_log_prob_pos[has_positives].mean()
+
+        return loss
+
+
+class TripletLoss(nn.Module):
+    """
+    Triplet loss with online hard mining.
+
+    For each anchor:
+    - Hardest positive: most distant sample with same label
+    - Hardest negative: closest sample with different label
+
+    Loss = max(0, d(anchor, hardest_pos) - d(anchor, hardest_neg) + margin)
+
+    This is an alternative to InfoNCE for when batch sizes are small.
+    """
+
+    def __init__(self, margin: float = 0.3):
+        """
+        Initialize Triplet loss.
+
+        Args:
+            margin: Minimum required gap between positive and negative distances
+        """
+        super().__init__()
+        self.margin = margin
+
+    def forward(
+        self,
+        embeddings: torch.Tensor,
+        labels: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Compute triplet loss with online hard mining.
+
+        Args:
+            embeddings: [N, D] L2-normalized embeddings
+            labels: [N] integer logo class labels
+
+        Returns:
+            Scalar loss value
+        """
+        device = embeddings.device
+        batch_size = embeddings.shape[0]
+
+        if batch_size <= 1:
+            return torch.tensor(0.0, device=device, requires_grad=True)
+
+        # Compute pairwise cosine distances (1 - cosine_similarity)
+        # For normalized vectors: distance = 1 - dot_product
+        similarity = embeddings @ embeddings.T
+        distances = 1 - similarity
+
+        # Create masks
+        labels_col = labels.unsqueeze(0)
+        labels_row = labels.unsqueeze(1)
+        positive_mask = (labels_row == labels_col).float()
+        negative_mask = 1 - positive_mask
+
+        # Remove self from positives (diagonal)
+        identity = torch.eye(batch_size, device=device)
+        positive_mask = positive_mask - identity
+
+        # Check if we have any valid triplets
+        has_positives = positive_mask.sum(dim=1) > 0
+        has_negatives = negative_mask.sum(dim=1) > 0
+        valid_anchors = has_positives & has_negatives
+
+        if not valid_anchors.any():
+            return torch.tensor(0.0, device=device, requires_grad=True)
+
+        # For each anchor, find hardest positive (max distance among positives)
+        # Set negatives to -inf so they don't affect max
+        pos_distances = distances.clone()
+        pos_distances[positive_mask == 0] = float("-inf")
+        hardest_positive, _ = pos_distances.max(dim=1)
+
+        # For each anchor, find hardest negative (min distance among negatives)
+        # Set positives to inf so they don't affect min
+        neg_distances = distances.clone()
+        neg_distances[negative_mask == 0] = float("inf")
+        hardest_negative, _ = neg_distances.min(dim=1)
+
+        # Triplet loss: want positive to be closer than negative by margin
+        triplet_loss = F.relu(
+            hardest_positive - hardest_negative + self.margin
+        )
+
+        # Average over valid anchors only
+        loss = triplet_loss[valid_anchors].mean()
+
+        return loss
+
+
+class CombinedLoss(nn.Module):
+    """
+    Combined loss function with weighted InfoNCE and Triplet losses.
+
+    Can help stabilize training by combining the benefits of both losses.
+    """
+
+    def __init__(
+        self,
+        temperature: float = 0.07,
+        triplet_margin: float = 0.3,
+        infonce_weight: float = 1.0,
+        triplet_weight: float = 0.5,
+    ):
+        super().__init__()
+        self.infonce = InfoNCELoss(temperature=temperature)
+        self.triplet = TripletLoss(margin=triplet_margin)
+        self.infonce_weight = infonce_weight
+        self.triplet_weight = triplet_weight
+
+    def forward(
+        self,
+        embeddings: torch.Tensor,
+        labels: torch.Tensor,
+    ) -> torch.Tensor:
+        infonce_loss = self.infonce(embeddings, labels)
+        triplet_loss = self.triplet(embeddings, labels)
+
+        return (
+            self.infonce_weight * infonce_loss +
+            self.triplet_weight * triplet_loss
+        )
+
+
+def get_loss_function(
+    loss_type: str = "infonce",
+    temperature: float = 0.07,
+    triplet_margin: float = 0.3,
+) -> nn.Module:
+    """
+    Factory function to create loss function.
+
+    Args:
+        loss_type: One of "infonce", "supcon", "triplet", or "combined"
+        temperature: Temperature for InfoNCE/SupCon
+        triplet_margin: Margin for triplet loss
+
+    Returns:
+        Loss function module
+    """
+    if loss_type == "infonce":
+        return InfoNCELoss(temperature=temperature)
+    elif loss_type == "supcon":
+        return SupConLoss(temperature=temperature)
+    elif loss_type == "triplet":
+        return TripletLoss(margin=triplet_margin)
+    elif loss_type == "combined":
+        return CombinedLoss(
+            temperature=temperature,
+            triplet_margin=triplet_margin,
+        )
+    else:
+        raise ValueError(f"Unknown loss type: {loss_type}")
--- a/training/model.py
+++ b/training/model.py
@ -0,0 +1,335 @@
+"""
+Fine-tunable CLIP model wrapper with LoRA support.
+"""
+
+import json
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import CLIPModel, CLIPProcessor
+
+# Check if peft is available for LoRA
+try:
+    from peft import LoraConfig, get_peft_model, PeftModel
+    PEFT_AVAILABLE = True
+except ImportError:
+    PEFT_AVAILABLE = False
+    LoraConfig = None
+    get_peft_model = None
+    PeftModel = None
+
+
+class LogoFineTunedCLIP(nn.Module):
+    """
+    CLIP vision encoder fine-tuned for logo similarity.
+
+    Preserves embedding interface for compatibility with DetectLogosDETR:
+    - Same embedding dimensionality (768 for ViT-L/14)
+    - L2 normalized outputs
+    - Works with existing get_image_features() pattern
+
+    Supports:
+    - LoRA for memory-efficient fine-tuning
+    - Layer freezing for transfer learning
+    - Gradient checkpointing for memory optimization
+    """
+
+    def __init__(
+        self,
+        vision_model: nn.Module,
+        lora_r: int = 16,
+        lora_alpha: int = 32,
+        lora_dropout: float = 0.1,
+        freeze_layers: int = 12,
+        use_gradient_checkpointing: bool = True,
+        add_projection_head: bool = True,
+    ):
+        """
+        Initialize the fine-tunable CLIP wrapper.
+
+        Args:
+            vision_model: CLIP vision model (CLIPVisionModel)
+            lora_r: Rank of LoRA low-rank matrices (0 to disable)
+            lora_alpha: LoRA scaling factor
+            lora_dropout: Dropout for LoRA layers
+            freeze_layers: Number of transformer layers to freeze (from bottom)
+            use_gradient_checkpointing: Enable gradient checkpointing
+            add_projection_head: Add trainable projection head
+        """
+        super().__init__()
+
+        self.vision_model = vision_model
+        self.embedding_dim = vision_model.config.hidden_size
+        self.freeze_layers = freeze_layers
+        self.lora_r = lora_r
+        self.lora_alpha = lora_alpha
+
+        # Enable gradient checkpointing for memory efficiency
+        if use_gradient_checkpointing:
+            if hasattr(self.vision_model, "gradient_checkpointing_enable"):
+                self.vision_model.gradient_checkpointing_enable()
+
+        # Freeze lower layers
+        self._freeze_layers(freeze_layers)
+
+        # Apply LoRA to attention layers in upper blocks
+        self.peft_applied = False
+        if PEFT_AVAILABLE and lora_r > 0:
+            self._apply_lora(lora_r, lora_alpha, lora_dropout)
+            self.peft_applied = True
+        elif lora_r > 0 and not PEFT_AVAILABLE:
+            print(
+                "Warning: peft not installed. LoRA disabled. "
+                "Install with: pip install peft"
+            )
+
+        # Optional projection head for fine-tuning
+        self.add_projection_head = add_projection_head
+        if add_projection_head:
+            self.projection = nn.Sequential(
+                nn.Linear(self.embedding_dim, self.embedding_dim),
+                nn.LayerNorm(self.embedding_dim),
+            )
+        else:
+            self.projection = nn.Identity()
+
+    def _freeze_layers(self, num_layers: int) -> None:
+        """Freeze the first N transformer layers and embeddings."""
+        if num_layers <= 0:
+            return
+
+        # Freeze embeddings
+        if hasattr(self.vision_model, "embeddings"):
+            for param in self.vision_model.embeddings.parameters():
+                param.requires_grad = False
+
+        # Freeze specified number of encoder layers
+        if hasattr(self.vision_model, "encoder"):
+            for i, layer in enumerate(self.vision_model.encoder.layers):
+                if i < num_layers:
+                    for param in layer.parameters():
+                        param.requires_grad = False
+
+    def _apply_lora(
+        self,
+        r: int,
+        alpha: int,
+        dropout: float,
+    ) -> None:
+        """Apply LoRA adapters to attention layers."""
+        if not PEFT_AVAILABLE:
+            return
+
+        # Configure LoRA for vision transformer
+        lora_config = LoraConfig(
+            r=r,
+            lora_alpha=alpha,
+            lora_dropout=dropout,
+            target_modules=["q_proj", "v_proj"],
+            bias="none",
+            modules_to_save=[],  # Don't save any full modules
+        )
+
+        self.vision_model = get_peft_model(self.vision_model, lora_config)
+
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        """
+        Extract normalized embeddings for logo images.
+
+        Args:
+            pixel_values: [batch, 3, 224, 224] preprocessed images
+
+        Returns:
+            embeddings: [batch, embedding_dim] L2-normalized
+        """
+        # Get vision features
+        outputs = self.vision_model(pixel_values=pixel_values)
+
+        # Use pooler output (CLS token projection) if available
+        if hasattr(outputs, "pooler_output") and outputs.pooler_output is not None:
+            features = outputs.pooler_output
+        else:
+            # Fall back to CLS token from last hidden state
+            features = outputs.last_hidden_state[:, 0, :]
+
+        # Apply projection head
+        features = self.projection(features)
+
+        # L2 normalize for cosine similarity
+        features = F.normalize(features, dim=-1)
+
+        return features
+
+    def get_image_features(self, **kwargs) -> torch.Tensor:
+        """
+        Compatibility method matching CLIP's interface.
+
+        Used by DetectLogosDETR._get_embedding_pil().
+        """
+        return self.forward(kwargs["pixel_values"])
+
+    def get_trainable_parameters(self) -> List[torch.nn.Parameter]:
+        """Return list of trainable parameters."""
+        return [p for p in self.parameters() if p.requires_grad]
+
+    def get_parameter_count(self) -> Dict[str, int]:
+        """Return count of trainable and total parameters."""
+        total = sum(p.numel() for p in self.parameters())
+        trainable = sum(p.numel() for p in self.parameters() if p.requires_grad)
+        return {
+            "total": total,
+            "trainable": trainable,
+            "frozen": total - trainable,
+            "trainable_percent": 100 * trainable / total if total > 0 else 0,
+        }
+
+    def save_pretrained(self, output_dir: str) -> None:
+        """
+        Save model in HuggingFace-compatible format.
+
+        Args:
+            output_dir: Directory to save model files
+        """
+        output_path = Path(output_dir)
+        output_path.mkdir(parents=True, exist_ok=True)
+
+        # Save model weights
+        if self.peft_applied and PEFT_AVAILABLE:
+            # Save LoRA weights separately
+            self.vision_model.save_pretrained(output_path / "vision_lora")
+            # Save projection head
+            torch.save(
+                self.projection.state_dict(),
+                output_path / "projection_head.bin",
+            )
+        else:
+            # Save full model state
+            torch.save(self.state_dict(), output_path / "pytorch_model.bin")
+
+        # Save config
+        config = {
+            "model_type": "clip_logo_finetuned",
+            "embedding_dim": self.embedding_dim,
+            "lora_r": self.lora_r,
+            "lora_alpha": self.lora_alpha,
+            "freeze_layers": self.freeze_layers,
+            "add_projection_head": self.add_projection_head,
+            "peft_applied": self.peft_applied,
+        }
+
+        with open(output_path / "config.json", "w") as f:
+            json.dump(config, f, indent=2)
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        model_path: str,
+        base_model: str = "openai/clip-vit-large-patch14",
+        device: Optional[torch.device] = None,
+    ) -> "LogoFineTunedCLIP":
+        """
+        Load a fine-tuned model from saved weights.
+
+        Args:
+            model_path: Path to saved model directory
+            base_model: Base CLIP model name (for architecture)
+            device: Device to load model on
+
+        Returns:
+            Loaded LogoFineTunedCLIP model
+        """
+        model_path = Path(model_path)
+
+        # Load config
+        with open(model_path / "config.json", "r") as f:
+            config = json.load(f)
+
+        # Load base CLIP model
+        clip_model = CLIPModel.from_pretrained(base_model)
+
+        # Create model instance
+        model = cls(
+            vision_model=clip_model.vision_model,
+            lora_r=config.get("lora_r", 0),
+            lora_alpha=config.get("lora_alpha", 1),
+            freeze_layers=config.get("freeze_layers", 12),
+            add_projection_head=config.get("add_projection_head", True),
+            use_gradient_checkpointing=False,  # Not needed for inference
+        )
+
+        # Load weights
+        if config.get("peft_applied", False) and PEFT_AVAILABLE:
+            # Load LoRA weights
+            lora_path = model_path / "vision_lora"
+            if lora_path.exists():
+                model.vision_model = PeftModel.from_pretrained(
+                    model.vision_model, lora_path
+                )
+            # Load projection head
+            proj_path = model_path / "projection_head.bin"
+            if proj_path.exists():
+                model.projection.load_state_dict(torch.load(proj_path))
+        else:
+            # Load full model state
+            weights_path = model_path / "pytorch_model.bin"
+            if weights_path.exists():
+                model.load_state_dict(torch.load(weights_path))
+
+        if device is not None:
+            model = model.to(device)
+
+        return model
+
+
+def create_model(
+    base_model: str = "openai/clip-vit-large-patch14",
+    lora_r: int = 16,
+    lora_alpha: int = 32,
+    lora_dropout: float = 0.1,
+    freeze_layers: int = 12,
+    use_gradient_checkpointing: bool = True,
+    device: Optional[torch.device] = None,
+) -> Tuple[LogoFineTunedCLIP, CLIPProcessor]:
+    """
+    Create a fine-tunable CLIP model and processor.
+
+    Args:
+        base_model: HuggingFace model name or path
+        lora_r: LoRA rank (0 to disable)
+        lora_alpha: LoRA scaling factor
+        lora_dropout: LoRA dropout
+        freeze_layers: Number of layers to freeze
+        use_gradient_checkpointing: Enable gradient checkpointing
+        device: Device to load model on
+
+    Returns:
+        Tuple of (model, processor)
+    """
+    # Load base CLIP model
+    clip_model = CLIPModel.from_pretrained(base_model)
+    processor = CLIPProcessor.from_pretrained(base_model)
+
+    # Create fine-tunable wrapper
+    model = LogoFineTunedCLIP(
+        vision_model=clip_model.vision_model,
+        lora_r=lora_r,
+        lora_alpha=lora_alpha,
+        lora_dropout=lora_dropout,
+        freeze_layers=freeze_layers,
+        use_gradient_checkpointing=use_gradient_checkpointing,
+    )
+
+    if device is not None:
+        model = model.to(device)
+
+    # Print parameter info
+    param_info = model.get_parameter_count()
+    print(f"Model created:")
+    print(f"  Total parameters: {param_info['total']:,}")
+    print(f"  Trainable: {param_info['trainable']:,} ({param_info['trainable_percent']:.2f}%)")
+    print(f"  Frozen: {param_info['frozen']:,}")
+
+    return model, processor
--- a/training/trainer.py
+++ b/training/trainer.py
@ -0,0 +1,405 @@
+"""
+Training loop with checkpointing, mixed precision, and evaluation.
+"""
+
+import json
+import logging
+import time
+from pathlib import Path
+from typing import Dict, Optional, Tuple
+
+import torch
+import torch.nn as nn
+from torch.optim import AdamW
+from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts, OneCycleLR
+from torch.utils.data import DataLoader
+from tqdm import tqdm
+
+from .config import TrainingConfig
+from .losses import get_loss_function
+from .evaluation import EmbeddingEvaluator
+
+# Check if amp is available
+try:
+    from torch.cuda.amp import autocast, GradScaler
+    AMP_AVAILABLE = True
+except ImportError:
+    AMP_AVAILABLE = False
+    autocast = None
+    GradScaler = None
+
+
+class Trainer:
+    """
+    Trainer for fine-tuning CLIP on logo recognition.
+
+    Features:
+    - Mixed precision training (FP16)
+    - Gradient accumulation
+    - Gradient checkpointing (via model)
+    - Cosine annealing LR scheduler
+    - Early stopping
+    - Checkpoint saving/loading
+    - Evaluation during training
+    """
+
+    def __init__(
+        self,
+        model: nn.Module,
+        train_loader: DataLoader,
+        val_loader: DataLoader,
+        config: TrainingConfig,
+        logger: Optional[logging.Logger] = None,
+    ):
+        """
+        Initialize the trainer.
+
+        Args:
+            model: LogoFineTunedCLIP model
+            train_loader: Training dataloader
+            val_loader: Validation dataloader
+            config: Training configuration
+            logger: Optional logger instance
+        """
+        self.model = model
+        self.train_loader = train_loader
+        self.val_loader = val_loader
+        self.config = config
+        self.logger = logger or logging.getLogger(__name__)
+
+        # Device setup
+        self.device = torch.device(
+            "cuda" if torch.cuda.is_available() else "cpu"
+        )
+        self.model.to(self.device)
+        self.logger.info(f"Using device: {self.device}")
+
+        # Optimizer - only trainable parameters
+        trainable_params = [p for p in model.parameters() if p.requires_grad]
+        self.logger.info(f"Trainable parameters: {sum(p.numel() for p in trainable_params):,}")
+
+        self.optimizer = AdamW(
+            trainable_params,
+            lr=config.learning_rate,
+            weight_decay=config.weight_decay,
+        )
+
+        # Learning rate scheduler
+        total_steps = len(train_loader) * config.max_epochs
+        self.scheduler = OneCycleLR(
+            self.optimizer,
+            max_lr=config.learning_rate,
+            total_steps=total_steps,
+            pct_start=config.warmup_steps / total_steps if total_steps > 0 else 0.1,
+            anneal_strategy="cos",
+        )
+
+        # Mixed precision training
+        self.use_amp = config.mixed_precision and AMP_AVAILABLE and self.device.type == "cuda"
+        if self.use_amp:
+            self.scaler = GradScaler()
+            self.logger.info("Mixed precision training enabled")
+        else:
+            self.scaler = None
+            if config.mixed_precision and not AMP_AVAILABLE:
+                self.logger.warning("Mixed precision requested but not available")
+
+        # Loss function
+        self.criterion = get_loss_function(
+            loss_type=config.loss_type,
+            temperature=config.temperature,
+            triplet_margin=config.triplet_margin,
+        )
+
+        # Evaluator
+        self.evaluator = EmbeddingEvaluator()
+
+        # Training state
+        self.epoch = 0
+        self.global_step = 0
+        self.best_val_loss = float("inf")
+        self.best_val_separation = float("-inf")
+        self.patience_counter = 0
+        self.training_history = []
+
+    def train(self) -> Dict[str, float]:
+        """
+        Main training loop.
+
+        Returns:
+            Dict with final training metrics
+        """
+        self.logger.info("Starting training...")
+        self.logger.info(f"  Epochs: {self.config.max_epochs}")
+        self.logger.info(f"  Batch size: {self.config.batch_size}")
+        self.logger.info(f"  Gradient accumulation: {self.config.gradient_accumulation_steps}")
+        self.logger.info(f"  Effective batch: {self.config.effective_batch_size}")
+        self.logger.info(f"  Learning rate: {self.config.learning_rate}")
+
+        start_time = time.time()
+
+        for epoch in range(self.epoch, self.config.max_epochs):
+            self.epoch = epoch
+            self.logger.info(f"\nEpoch {epoch + 1}/{self.config.max_epochs}")
+
+            # Training epoch
+            train_metrics = self._train_epoch()
+            self.logger.info(
+                f"Train - Loss: {train_metrics['loss']:.4f}, "
+                f"LR: {train_metrics['lr']:.2e}"
+            )
+
+            # Validation
+            if (epoch + 1) % self.config.eval_every_n_epochs == 0:
+                val_metrics = self._validate()
+                self.logger.info(
+                    f"Val - Loss: {val_metrics['loss']:.4f}, "
+                    f"Pos Sim: {val_metrics['mean_pos_sim']:.3f}, "
+                    f"Neg Sim: {val_metrics['mean_neg_sim']:.3f}, "
+                    f"Separation: {val_metrics['separation']:.3f}"
+                )
+
+                # Record history
+                self.training_history.append({
+                    "epoch": epoch + 1,
+                    "train_loss": train_metrics["loss"],
+                    "val_loss": val_metrics["loss"],
+                    "val_separation": val_metrics["separation"],
+                    "val_pos_sim": val_metrics["mean_pos_sim"],
+                    "val_neg_sim": val_metrics["mean_neg_sim"],
+                })
+
+                # Checkpointing based on separation (primary) or loss (secondary)
+                improved = False
+                if val_metrics["separation"] > self.best_val_separation + self.config.min_delta:
+                    self.best_val_separation = val_metrics["separation"]
+                    improved = True
+                elif val_metrics["loss"] < self.best_val_loss - self.config.min_delta:
+                    self.best_val_loss = val_metrics["loss"]
+                    improved = True
+
+                if improved:
+                    self.patience_counter = 0
+                    self._save_checkpoint("best.pt")
+                    self.logger.info("New best model saved!")
+                else:
+                    self.patience_counter += 1
+
+                # Early stopping
+                if self.patience_counter >= self.config.patience:
+                    self.logger.info(
+                        f"Early stopping triggered at epoch {epoch + 1} "
+                        f"(no improvement for {self.config.patience} epochs)"
+                    )
+                    break
+
+            # Periodic checkpoint
+            if (epoch + 1) % self.config.save_every_n_epochs == 0:
+                self._save_checkpoint(f"epoch_{epoch + 1}.pt")
+
+        # Training complete
+        total_time = time.time() - start_time
+        self.logger.info(f"\nTraining completed in {total_time / 60:.1f} minutes")
+
+        # Load best model
+        best_path = Path(self.config.checkpoint_dir) / "best.pt"
+        if best_path.exists():
+            self.load_checkpoint("best.pt")
+            self.logger.info("Loaded best model checkpoint")
+
+        return {
+            "best_val_loss": self.best_val_loss,
+            "best_val_separation": self.best_val_separation,
+            "total_epochs": self.epoch + 1,
+            "total_time_minutes": total_time / 60,
+        }
+
+    def _train_epoch(self) -> Dict[str, float]:
+        """Run a single training epoch."""
+        self.model.train()
+        total_loss = 0.0
+        num_batches = 0
+        accumulation_steps = 0
+
+        progress_bar = tqdm(
+            self.train_loader,
+            desc=f"Epoch {self.epoch + 1}",
+            leave=False,
+        )
+
+        self.optimizer.zero_grad()
+
+        for batch_idx, (images, labels) in enumerate(progress_bar):
+            images = images.to(self.device)
+            labels = labels.to(self.device)
+
+            # Forward pass with mixed precision
+            if self.use_amp:
+                with autocast():
+                    embeddings = self.model(images)
+                    loss = self.criterion(embeddings, labels)
+                    loss = loss / self.config.gradient_accumulation_steps
+
+                self.scaler.scale(loss).backward()
+            else:
+                embeddings = self.model(images)
+                loss = self.criterion(embeddings, labels)
+                loss = loss / self.config.gradient_accumulation_steps
+                loss.backward()
+
+            accumulation_steps += 1
+
+            # Optimizer step after accumulation
+            if accumulation_steps >= self.config.gradient_accumulation_steps:
+                if self.use_amp:
+                    self.scaler.step(self.optimizer)
+                    self.scaler.update()
+                else:
+                    self.optimizer.step()
+
+                self.optimizer.zero_grad()
+                self.scheduler.step()
+                self.global_step += 1
+                accumulation_steps = 0
+
+            total_loss += loss.item() * self.config.gradient_accumulation_steps
+            num_batches += 1
+
+            # Update progress bar
+            progress_bar.set_postfix({
+                "loss": total_loss / num_batches,
+                "lr": self.scheduler.get_last_lr()[0],
+            })
+
+            # Logging
+            if (batch_idx + 1) % self.config.log_every_n_steps == 0:
+                self.logger.debug(
+                    f"Step {self.global_step}: loss={total_loss / num_batches:.4f}"
+                )
+
+        return {
+            "loss": total_loss / max(num_batches, 1),
+            "lr": self.scheduler.get_last_lr()[0],
+        }
+
+    def _validate(self) -> Dict[str, float]:
+        """Run validation and compute metrics."""
+        self.model.eval()
+        total_loss = 0.0
+        all_embeddings = []
+        all_labels = []
+
+        with torch.no_grad():
+            for images, labels in tqdm(self.val_loader, desc="Validating", leave=False):
+                images = images.to(self.device)
+                labels = labels.to(self.device)
+
+                if self.use_amp:
+                    with autocast():
+                        embeddings = self.model(images)
+                        loss = self.criterion(embeddings, labels)
+                else:
+                    embeddings = self.model(images)
+                    loss = self.criterion(embeddings, labels)
+
+                total_loss += loss.item()
+                all_embeddings.append(embeddings.cpu())
+                all_labels.append(labels.cpu())
+
+        # Combine batches
+        all_embeddings = torch.cat(all_embeddings, dim=0)
+        all_labels = torch.cat(all_labels, dim=0)
+
+        # Compute embedding quality metrics
+        metrics = self.evaluator.compute_metrics(all_embeddings, all_labels)
+        metrics["loss"] = total_loss / max(len(self.val_loader), 1)
+
+        return metrics
+
+    def _save_checkpoint(self, filename: str) -> None:
+        """Save training checkpoint."""
+        checkpoint_dir = Path(self.config.checkpoint_dir)
+        checkpoint_dir.mkdir(parents=True, exist_ok=True)
+
+        checkpoint = {
+            "epoch": self.epoch,
+            "global_step": self.global_step,
+            "model_state_dict": self.model.state_dict(),
+            "optimizer_state_dict": self.optimizer.state_dict(),
+            "scheduler_state_dict": self.scheduler.state_dict(),
+            "best_val_loss": self.best_val_loss,
+            "best_val_separation": self.best_val_separation,
+            "patience_counter": self.patience_counter,
+            "training_history": self.training_history,
+            "config": self.config.__dict__,
+        }
+
+        if self.scaler is not None:
+            checkpoint["scaler_state_dict"] = self.scaler.state_dict()
+
+        torch.save(checkpoint, checkpoint_dir / filename)
+        self.logger.debug(f"Saved checkpoint: {filename}")
+
+    def load_checkpoint(self, filename: str) -> None:
+        """Load training checkpoint."""
+        checkpoint_path = Path(self.config.checkpoint_dir) / filename
+        if not checkpoint_path.exists():
+            self.logger.warning(f"Checkpoint not found: {checkpoint_path}")
+            return
+
+        checkpoint = torch.load(checkpoint_path, map_location=self.device)
+
+        self.model.load_state_dict(checkpoint["model_state_dict"])
+        self.optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
+        self.scheduler.load_state_dict(checkpoint["scheduler_state_dict"])
+        self.epoch = checkpoint["epoch"]
+        self.global_step = checkpoint["global_step"]
+        self.best_val_loss = checkpoint["best_val_loss"]
+        self.best_val_separation = checkpoint.get("best_val_separation", float("-inf"))
+        self.patience_counter = checkpoint.get("patience_counter", 0)
+        self.training_history = checkpoint.get("training_history", [])
+
+        if self.scaler is not None and "scaler_state_dict" in checkpoint:
+            self.scaler.load_state_dict(checkpoint["scaler_state_dict"])
+
+        self.logger.info(f"Resumed from epoch {self.epoch + 1}")
+
+    def export_model(self, output_dir: Optional[str] = None) -> str:
+        """
+        Export the trained model for inference.
+
+        Args:
+            output_dir: Output directory (uses config.output_dir if not specified)
+
+        Returns:
+            Path to exported model directory
+        """
+        output_dir = output_dir or self.config.output_dir
+        output_path = Path(output_dir)
+        output_path.mkdir(parents=True, exist_ok=True)
+
+        # Save model
+        self.model.save_pretrained(output_dir)
+
+        # Save training config
+        config_path = output_path / "training_config.json"
+        with open(config_path, "w") as f:
+            json.dump(self.config.__dict__, f, indent=2)
+
+        # Save training history
+        history_path = output_path / "training_history.json"
+        with open(history_path, "w") as f:
+            json.dump(self.training_history, f, indent=2)
+
+        self.logger.info(f"Model exported to: {output_path}")
+        return str(output_path)
+
+    def get_training_summary(self) -> Dict:
+        """Get summary of training."""
+        return {
+            "epochs_completed": self.epoch + 1,
+            "global_steps": self.global_step,
+            "best_val_loss": self.best_val_loss,
+            "best_val_separation": self.best_val_separation,
+            "history": self.training_history,
+        }