Add CLIP fine-tuning pipeline for logo recognition

Implement contrastive learning with LoRA to fine-tune CLIP's vision encoder on LogoDet-3K dataset for improved logo embedding similarity. New training module (training/): - config.py: TrainingConfig dataclass with all hyperparameters - dataset.py: LogoContrastiveDataset with logo-level splits - model.py: LogoFineTunedCLIP wrapper with LoRA support - losses.py: InfoNCE, TripletLoss, SupConLoss implementations - trainer.py: Training loop with mixed precision and checkpointing - evaluation.py: EmbeddingEvaluator for validation metrics New scripts: - train_clip_logo.py: Main training entry point - export_model.py: Export to HuggingFace-compatible format Configurations: - configs/jetson_orin.yaml: Optimized for Jetson Orin AGX - configs/cloud_rtx4090.yaml: Optimized for 24GB cloud GPUs - configs/cloud_a100.yaml: Optimized for 80GB cloud GPUs Documentation: - CLIP_FINETUNING.md: Training guide and usage instructions - CLOUD_TRAINING.md: Cloud GPU recommendations and cost estimates Modified: - logo_detection_detr.py: Add fine-tuned model loading support - pyproject.toml: Add peft, pyyaml, torchvision dependencies
2026-01-04 13:45:25 -05:00
parent 1551360028
commit 44e8b6ae7d
16 changed files with 3334 additions and 12 deletions
--- a/CLIP_FINETUNING.md
+++ b/CLIP_FINETUNING.md
@ -0,0 +1,266 @@
+# CLIP Fine-Tuning for Logo Recognition
+
+This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset.
+
+## Overview
+
+The fine-tuning approach uses **contrastive learning** with **LoRA** (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing `DetectLogosDETR` class.
+
+**Goal**: Improve F1 from ~60% to >72% on logo matching tasks.
+
+## Files Created
+
+### Training Module (`training/`)
+
+| File | Description |
+|------|-------------|
+| `__init__.py` | Module exports |
+| `config.py` | `TrainingConfig` dataclass with all hyperparameters |
+| `dataset.py` | `LogoContrastiveDataset` with logo-level splits and augmentations |
+| `model.py` | `LogoFineTunedCLIP` wrapper with LoRA support |
+| `losses.py` | `InfoNCELoss`, `TripletLoss`, `SupConLoss`, `CombinedLoss` |
+| `trainer.py` | Training loop with mixed precision, checkpointing, early stopping |
+| `evaluation.py` | `EmbeddingEvaluator` for validation metrics |
+
+### Scripts
+
+| File | Description |
+|------|-------------|
+| `train_clip_logo.py` | Main training entry point |
+| `export_model.py` | Export trained models to HuggingFace-compatible format |
+
+### Configuration
+
+| File | Description |
+|------|-------------|
+| `configs/jetson_orin.yaml` | Training config optimized for Jetson Orin AGX |
+
+## Prerequisites
+
+1. **Install dependencies**:
+   ```bash
+   uv sync
+   ```
+
+2. **Prepare test data** (if not already done):
+   ```bash
+   uv run python prepare_test_data.py
+   ```
+
+   This creates:
+   - `reference_logos/` - Cropped logo images organized by category/brand
+   - `test_images/` - Full images for testing
+   - `test_data_mapping.db` - SQLite database with mappings
+
+## Training
+
+### Basic Training
+
+```bash
+uv run python train_clip_logo.py --config configs/jetson_orin.yaml
+```
+
+### Training with Overrides
+
+```bash
+uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
+    --learning-rate 5e-6 \
+    --max-epochs 30 \
+    --batch-size 8
+```
+
+### Resume from Checkpoint
+
+```bash
+uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
+    --resume checkpoints/epoch_10.pt
+```
+
+### Training Output
+
+- Checkpoints saved to `checkpoints/`
+- Best model saved as `checkpoints/best.pt`
+- Final model exported to `models/logo_detection/clip_finetuned/`
+
+## Configuration Options
+
+Key parameters in `configs/jetson_orin.yaml`:
+
+```yaml
+# Model
+base_model: "openai/clip-vit-large-patch14"
+lora_r: 16                    # LoRA rank (0 to disable)
+lora_alpha: 32                # LoRA scaling factor
+freeze_layers: 12             # Freeze first N transformer layers
+
+# Batch construction
+batch_size: 16
+logos_per_batch: 32           # Different logos per batch
+samples_per_logo: 4           # Samples per logo (creates positive pairs)
+gradient_accumulation_steps: 8  # Effective batch = 128
+
+# Training
+learning_rate: 1.0e-5
+max_epochs: 20
+mixed_precision: true
+temperature: 0.07             # InfoNCE temperature
+
+# Early stopping
+patience: 5
+min_delta: 0.001
+```
+
+## Evaluation
+
+### Test Fine-Tuned Model
+
+```bash
+uv run python test_logo_detection.py -n 50 \
+    -e models/logo_detection/clip_finetuned \
+    --matching-method multi-ref \
+    --seed 42
+```
+
+### Compare with Baseline
+
+```bash
+# Baseline CLIP
+uv run python test_logo_detection.py -n 50 \
+    -e openai/clip-vit-large-patch14 \
+    --matching-method multi-ref \
+    --seed 42
+
+# Fine-tuned model
+uv run python test_logo_detection.py -n 50 \
+    -e models/logo_detection/clip_finetuned \
+    --matching-method multi-ref \
+    --seed 42
+```
+
+### Expected Metrics
+
+| Metric | Baseline CLIP | Target (Fine-tuned) |
+|--------|---------------|---------------------|
+| Precision | ~49% | >70% |
+| Recall | ~77% | >75% |
+| F1 Score | ~60% | >72% |
+
+Training metrics to monitor:
+- Mean positive similarity: target > 0.85
+- Mean negative similarity: target < 0.50
+- Embedding separation: target > 0.35
+
+## Export Model
+
+To export a checkpoint to HuggingFace format:
+
+```bash
+uv run python export_model.py \
+    --checkpoint checkpoints/best.pt \
+    --output models/logo_detection/clip_finetuned
+```
+
+With LoRA weight merging (reduces inference overhead):
+
+```bash
+uv run python export_model.py \
+    --checkpoint checkpoints/best.pt \
+    --output models/logo_detection/clip_finetuned \
+    --merge-lora
+```
+
+## Using Fine-Tuned Model with DetectLogosDETR
+
+The fine-tuned model works as a drop-in replacement:
+
+```python
+from logo_detection_detr import DetectLogosDETR
+
+# Use fine-tuned model
+detector = DetectLogosDETR(
+    logger=logger,
+    embedding_model="models/logo_detection/clip_finetuned",
+)
+
+# Or use baseline for comparison
+detector_baseline = DetectLogosDETR(
+    logger=logger,
+    embedding_model="openai/clip-vit-large-patch14",
+)
+```
+
+## Architecture Details
+
+### Training Approach
+
+1. **Contrastive Learning**: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos.
+
+2. **LoRA (Low-Rank Adaptation)**: Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting.
+
+3. **Layer Freezing**: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics.
+
+4. **Logo-Level Splits**: Splits data by logo brand (not by image) to test generalization to unseen logos.
+
+### Batch Construction
+
+Each batch contains:
+- K different logo brands (default: 32)
+- M samples per brand (default: 4)
+- Total samples: K × M = 128
+
+This ensures positive pairs (same logo) exist within each batch for contrastive learning.
+
+### Data Augmentation
+
+Medium strength augmentations:
+- Random horizontal flip
+- Random rotation (±15°)
+- Color jitter (brightness, contrast, saturation)
+- Random affine transforms
+- Random grayscale (10% of images)
+
+## Troubleshooting
+
+### Out of Memory
+
+Reduce batch size and increase gradient accumulation:
+
+```bash
+uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
+    --batch-size 8 \
+    --gradient-accumulation-steps 16
+```
+
+### Slow Training
+
+Ensure mixed precision is enabled:
+
+```bash
+uv run python train_clip_logo.py --config configs/jetson_orin.yaml
+# mixed_precision: true is default in jetson_orin.yaml
+```
+
+### No Improvement
+
+Try adjusting:
+- Lower learning rate: `--learning-rate 5e-6`
+- Higher temperature: `--temperature 0.1`
+- Different loss: edit config to use `loss_type: "combined"`
+
+### Import Error for Fine-Tuned Model
+
+Ensure the `training/` module is in your Python path:
+
+```bash
+export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test"
+```
+
+## Dependencies Added
+
+The following were added to `pyproject.toml`:
+
+```toml
+peft>=0.7.0        # LoRA support
+pyyaml>=6.0        # Config file parsing
+torchvision>=0.20.0  # Image transforms
+```