# CLIP Fine-Tuning for Logo Recognition This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset. ## Overview The fine-tuning approach uses **contrastive learning** with **LoRA** (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing `DetectLogosDETR` class. **Goal**: Improve F1 from ~60% to >72% on logo matching tasks. ## Files Created ### Training Module (`training/`) | File | Description | |------|-------------| | `__init__.py` | Module exports | | `config.py` | `TrainingConfig` dataclass with all hyperparameters | | `dataset.py` | `LogoContrastiveDataset` with logo-level splits and augmentations | | `model.py` | `LogoFineTunedCLIP` wrapper with LoRA support | | `losses.py` | `InfoNCELoss`, `TripletLoss`, `SupConLoss`, `CombinedLoss` | | `trainer.py` | Training loop with mixed precision, checkpointing, early stopping | | `evaluation.py` | `EmbeddingEvaluator` for validation metrics | ### Scripts | File | Description | |------|-------------| | `train_clip_logo.py` | Main training entry point | | `export_model.py` | Export trained models to HuggingFace-compatible format | ### Configuration | File | Description | |------|-------------| | `configs/jetson_orin.yaml` | Training config optimized for Jetson Orin AGX | ## Prerequisites 1. **Install dependencies**: ```bash uv sync ``` 2. **Prepare test data** (if not already done): ```bash uv run python prepare_test_data.py ``` This creates: - `reference_logos/` - Cropped logo images organized by category/brand - `test_images/` - Full images for testing - `test_data_mapping.db` - SQLite database with mappings ## Training ### Basic Training ```bash uv run python train_clip_logo.py --config configs/jetson_orin.yaml ``` ### Training with Overrides ```bash uv run python train_clip_logo.py --config configs/jetson_orin.yaml \ --learning-rate 5e-6 \ --max-epochs 30 \ --batch-size 8 ``` ### Resume from Checkpoint ```bash uv run python train_clip_logo.py --config configs/jetson_orin.yaml \ --resume checkpoints/epoch_10.pt ``` ### Training Output - Checkpoints saved to `checkpoints/` - Best model saved as `checkpoints/best.pt` - Final model exported to `models/logo_detection/clip_finetuned/` ## Configuration Options Key parameters in `configs/jetson_orin.yaml`: ```yaml # Model base_model: "openai/clip-vit-large-patch14" lora_r: 16 # LoRA rank (0 to disable) lora_alpha: 32 # LoRA scaling factor freeze_layers: 12 # Freeze first N transformer layers # Batch construction batch_size: 16 logos_per_batch: 32 # Different logos per batch samples_per_logo: 4 # Samples per logo (creates positive pairs) gradient_accumulation_steps: 8 # Effective batch = 128 # Training learning_rate: 1.0e-5 max_epochs: 20 mixed_precision: true temperature: 0.07 # InfoNCE temperature # Early stopping patience: 5 min_delta: 0.001 ``` ## Evaluation ### Test Fine-Tuned Model ```bash uv run python test_logo_detection.py -n 50 \ -e models/logo_detection/clip_finetuned \ --matching-method multi-ref \ --seed 42 ``` ### Compare with Baseline ```bash # Baseline CLIP uv run python test_logo_detection.py -n 50 \ -e openai/clip-vit-large-patch14 \ --matching-method multi-ref \ --seed 42 # Fine-tuned model uv run python test_logo_detection.py -n 50 \ -e models/logo_detection/clip_finetuned \ --matching-method multi-ref \ --seed 42 ``` ### Expected Metrics | Metric | Baseline CLIP | Target (Fine-tuned) | |--------|---------------|---------------------| | Precision | ~49% | >70% | | Recall | ~77% | >75% | | F1 Score | ~60% | >72% | Training metrics to monitor: - Mean positive similarity: target > 0.85 - Mean negative similarity: target < 0.50 - Embedding separation: target > 0.35 ## Export Model To export a checkpoint to HuggingFace format: ```bash uv run python export_model.py \ --checkpoint checkpoints/best.pt \ --output models/logo_detection/clip_finetuned ``` With LoRA weight merging (reduces inference overhead): ```bash uv run python export_model.py \ --checkpoint checkpoints/best.pt \ --output models/logo_detection/clip_finetuned \ --merge-lora ``` ## Using Fine-Tuned Model with DetectLogosDETR The fine-tuned model works as a drop-in replacement: ```python from logo_detection_detr import DetectLogosDETR # Use fine-tuned model detector = DetectLogosDETR( logger=logger, embedding_model="models/logo_detection/clip_finetuned", ) # Or use baseline for comparison detector_baseline = DetectLogosDETR( logger=logger, embedding_model="openai/clip-vit-large-patch14", ) ``` ## Architecture Details ### Training Approach 1. **Contrastive Learning**: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos. 2. **LoRA (Low-Rank Adaptation)**: Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting. 3. **Layer Freezing**: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics. 4. **Logo-Level Splits**: Splits data by logo brand (not by image) to test generalization to unseen logos. ### Batch Construction Each batch contains: - K different logo brands (default: 32) - M samples per brand (default: 4) - Total samples: K × M = 128 This ensures positive pairs (same logo) exist within each batch for contrastive learning. ### Data Augmentation Medium strength augmentations: - Random horizontal flip - Random rotation (±15°) - Color jitter (brightness, contrast, saturation) - Random affine transforms - Random grayscale (10% of images) ## Troubleshooting ### Out of Memory Reduce batch size and increase gradient accumulation: ```bash uv run python train_clip_logo.py --config configs/jetson_orin.yaml \ --batch-size 8 \ --gradient-accumulation-steps 16 ``` ### Slow Training Ensure mixed precision is enabled: ```bash uv run python train_clip_logo.py --config configs/jetson_orin.yaml # mixed_precision: true is default in jetson_orin.yaml ``` ### No Improvement Try adjusting: - Lower learning rate: `--learning-rate 5e-6` - Higher temperature: `--temperature 0.1` - Different loss: edit config to use `loss_type: "combined"` ### Import Error for Fine-Tuned Model Ensure the `training/` module is in your Python path: ```bash export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test" ``` ## Dependencies Added The following were added to `pyproject.toml`: ```toml peft>=0.7.0 # LoRA support pyyaml>=6.0 # Config file parsing torchvision>=0.20.0 # Image transforms ```