Add CLIP fine-tuning pipeline for logo recognition

Implement contrastive learning with LoRA to fine-tune CLIP's vision
encoder on LogoDet-3K dataset for improved logo embedding similarity.

New training module (training/):
- config.py: TrainingConfig dataclass with all hyperparameters
- dataset.py: LogoContrastiveDataset with logo-level splits
- model.py: LogoFineTunedCLIP wrapper with LoRA support
- losses.py: InfoNCE, TripletLoss, SupConLoss implementations
- trainer.py: Training loop with mixed precision and checkpointing
- evaluation.py: EmbeddingEvaluator for validation metrics

New scripts:
- train_clip_logo.py: Main training entry point
- export_model.py: Export to HuggingFace-compatible format

Configurations:
- configs/jetson_orin.yaml: Optimized for Jetson Orin AGX
- configs/cloud_rtx4090.yaml: Optimized for 24GB cloud GPUs
- configs/cloud_a100.yaml: Optimized for 80GB cloud GPUs

Documentation:
- CLIP_FINETUNING.md: Training guide and usage instructions
- CLOUD_TRAINING.md: Cloud GPU recommendations and cost estimates

Modified:
- logo_detection_detr.py: Add fine-tuned model loading support
- pyproject.toml: Add peft, pyyaml, torchvision dependencies
This commit is contained in:
Rick McEwen
2026-01-04 13:45:25 -05:00
parent 1551360028
commit 44e8b6ae7d
16 changed files with 3334 additions and 12 deletions

266
CLIP_FINETUNING.md Normal file
View File

@ -0,0 +1,266 @@
# CLIP Fine-Tuning for Logo Recognition
This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset.
## Overview
The fine-tuning approach uses **contrastive learning** with **LoRA** (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing `DetectLogosDETR` class.
**Goal**: Improve F1 from ~60% to >72% on logo matching tasks.
## Files Created
### Training Module (`training/`)
| File | Description |
|------|-------------|
| `__init__.py` | Module exports |
| `config.py` | `TrainingConfig` dataclass with all hyperparameters |
| `dataset.py` | `LogoContrastiveDataset` with logo-level splits and augmentations |
| `model.py` | `LogoFineTunedCLIP` wrapper with LoRA support |
| `losses.py` | `InfoNCELoss`, `TripletLoss`, `SupConLoss`, `CombinedLoss` |
| `trainer.py` | Training loop with mixed precision, checkpointing, early stopping |
| `evaluation.py` | `EmbeddingEvaluator` for validation metrics |
### Scripts
| File | Description |
|------|-------------|
| `train_clip_logo.py` | Main training entry point |
| `export_model.py` | Export trained models to HuggingFace-compatible format |
### Configuration
| File | Description |
|------|-------------|
| `configs/jetson_orin.yaml` | Training config optimized for Jetson Orin AGX |
## Prerequisites
1. **Install dependencies**:
```bash
uv sync
```
2. **Prepare test data** (if not already done):
```bash
uv run python prepare_test_data.py
```
This creates:
- `reference_logos/` - Cropped logo images organized by category/brand
- `test_images/` - Full images for testing
- `test_data_mapping.db` - SQLite database with mappings
## Training
### Basic Training
```bash
uv run python train_clip_logo.py --config configs/jetson_orin.yaml
```
### Training with Overrides
```bash
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
--learning-rate 5e-6 \
--max-epochs 30 \
--batch-size 8
```
### Resume from Checkpoint
```bash
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
--resume checkpoints/epoch_10.pt
```
### Training Output
- Checkpoints saved to `checkpoints/`
- Best model saved as `checkpoints/best.pt`
- Final model exported to `models/logo_detection/clip_finetuned/`
## Configuration Options
Key parameters in `configs/jetson_orin.yaml`:
```yaml
# Model
base_model: "openai/clip-vit-large-patch14"
lora_r: 16 # LoRA rank (0 to disable)
lora_alpha: 32 # LoRA scaling factor
freeze_layers: 12 # Freeze first N transformer layers
# Batch construction
batch_size: 16
logos_per_batch: 32 # Different logos per batch
samples_per_logo: 4 # Samples per logo (creates positive pairs)
gradient_accumulation_steps: 8 # Effective batch = 128
# Training
learning_rate: 1.0e-5
max_epochs: 20
mixed_precision: true
temperature: 0.07 # InfoNCE temperature
# Early stopping
patience: 5
min_delta: 0.001
```
## Evaluation
### Test Fine-Tuned Model
```bash
uv run python test_logo_detection.py -n 50 \
-e models/logo_detection/clip_finetuned \
--matching-method multi-ref \
--seed 42
```
### Compare with Baseline
```bash
# Baseline CLIP
uv run python test_logo_detection.py -n 50 \
-e openai/clip-vit-large-patch14 \
--matching-method multi-ref \
--seed 42
# Fine-tuned model
uv run python test_logo_detection.py -n 50 \
-e models/logo_detection/clip_finetuned \
--matching-method multi-ref \
--seed 42
```
### Expected Metrics
| Metric | Baseline CLIP | Target (Fine-tuned) |
|--------|---------------|---------------------|
| Precision | ~49% | >70% |
| Recall | ~77% | >75% |
| F1 Score | ~60% | >72% |
Training metrics to monitor:
- Mean positive similarity: target > 0.85
- Mean negative similarity: target < 0.50
- Embedding separation: target > 0.35
## Export Model
To export a checkpoint to HuggingFace format:
```bash
uv run python export_model.py \
--checkpoint checkpoints/best.pt \
--output models/logo_detection/clip_finetuned
```
With LoRA weight merging (reduces inference overhead):
```bash
uv run python export_model.py \
--checkpoint checkpoints/best.pt \
--output models/logo_detection/clip_finetuned \
--merge-lora
```
## Using Fine-Tuned Model with DetectLogosDETR
The fine-tuned model works as a drop-in replacement:
```python
from logo_detection_detr import DetectLogosDETR
# Use fine-tuned model
detector = DetectLogosDETR(
logger=logger,
embedding_model="models/logo_detection/clip_finetuned",
)
# Or use baseline for comparison
detector_baseline = DetectLogosDETR(
logger=logger,
embedding_model="openai/clip-vit-large-patch14",
)
```
## Architecture Details
### Training Approach
1. **Contrastive Learning**: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos.
2. **LoRA (Low-Rank Adaptation)**: Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting.
3. **Layer Freezing**: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics.
4. **Logo-Level Splits**: Splits data by logo brand (not by image) to test generalization to unseen logos.
### Batch Construction
Each batch contains:
- K different logo brands (default: 32)
- M samples per brand (default: 4)
- Total samples: K × M = 128
This ensures positive pairs (same logo) exist within each batch for contrastive learning.
### Data Augmentation
Medium strength augmentations:
- Random horizontal flip
- Random rotation (±15°)
- Color jitter (brightness, contrast, saturation)
- Random affine transforms
- Random grayscale (10% of images)
## Troubleshooting
### Out of Memory
Reduce batch size and increase gradient accumulation:
```bash
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
--batch-size 8 \
--gradient-accumulation-steps 16
```
### Slow Training
Ensure mixed precision is enabled:
```bash
uv run python train_clip_logo.py --config configs/jetson_orin.yaml
# mixed_precision: true is default in jetson_orin.yaml
```
### No Improvement
Try adjusting:
- Lower learning rate: `--learning-rate 5e-6`
- Higher temperature: `--temperature 0.1`
- Different loss: edit config to use `loss_type: "combined"`
### Import Error for Fine-Tuned Model
Ensure the `training/` module is in your Python path:
```bash
export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test"
```
## Dependencies Added
The following were added to `pyproject.toml`:
```toml
peft>=0.7.0 # LoRA support
pyyaml>=6.0 # Config file parsing
torchvision>=0.20.0 # Image transforms
```