- Add threshold selection section with similarity distribution analysis - Document that fine-tuned model needs threshold 0.82 (vs baseline 0.75) - Add table comparing baseline vs fine-tuned distributions - Update test commands to include correct thresholds - Reference analyze_similarity_distribution.sh for threshold optimization
302 lines
8.1 KiB
Markdown
302 lines
8.1 KiB
Markdown
# CLIP Fine-Tuning for Logo Recognition
|
||
|
||
This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset.
|
||
|
||
## Overview
|
||
|
||
The fine-tuning approach uses **contrastive learning** with **LoRA** (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing `DetectLogosDETR` class.
|
||
|
||
**Goal**: Improve F1 from ~60% to >72% on logo matching tasks.
|
||
|
||
## Files Created
|
||
|
||
### Training Module (`training/`)
|
||
|
||
| File | Description |
|
||
|------|-------------|
|
||
| `__init__.py` | Module exports |
|
||
| `config.py` | `TrainingConfig` dataclass with all hyperparameters |
|
||
| `dataset.py` | `LogoContrastiveDataset` with logo-level splits and augmentations |
|
||
| `model.py` | `LogoFineTunedCLIP` wrapper with LoRA support |
|
||
| `losses.py` | `InfoNCELoss`, `TripletLoss`, `SupConLoss`, `CombinedLoss` |
|
||
| `trainer.py` | Training loop with mixed precision, checkpointing, early stopping |
|
||
| `evaluation.py` | `EmbeddingEvaluator` for validation metrics |
|
||
|
||
### Scripts
|
||
|
||
| File | Description |
|
||
|------|-------------|
|
||
| `train_clip_logo.py` | Main training entry point |
|
||
| `export_model.py` | Export trained models to HuggingFace-compatible format |
|
||
|
||
### Configuration
|
||
|
||
| File | Description |
|
||
|------|-------------|
|
||
| `configs/jetson_orin.yaml` | Training config optimized for Jetson Orin AGX |
|
||
|
||
## Prerequisites
|
||
|
||
1. **Install dependencies**:
|
||
```bash
|
||
uv sync
|
||
```
|
||
|
||
2. **Prepare test data** (if not already done):
|
||
```bash
|
||
uv run python prepare_test_data.py
|
||
```
|
||
|
||
This creates:
|
||
- `reference_logos/` - Cropped logo images organized by category/brand
|
||
- `test_images/` - Full images for testing
|
||
- `test_data_mapping.db` - SQLite database with mappings
|
||
|
||
## Training
|
||
|
||
### Basic Training
|
||
|
||
```bash
|
||
uv run python train_clip_logo.py --config configs/jetson_orin.yaml
|
||
```
|
||
|
||
### Training with Overrides
|
||
|
||
```bash
|
||
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
|
||
--learning-rate 5e-6 \
|
||
--max-epochs 30 \
|
||
--batch-size 8
|
||
```
|
||
|
||
### Resume from Checkpoint
|
||
|
||
```bash
|
||
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
|
||
--resume checkpoints/epoch_10.pt
|
||
```
|
||
|
||
### Training Output
|
||
|
||
- Checkpoints saved to `checkpoints/`
|
||
- Best model saved as `checkpoints/best.pt`
|
||
- Final model exported to `models/logo_detection/clip_finetuned/`
|
||
|
||
## Configuration Options
|
||
|
||
Key parameters in `configs/jetson_orin.yaml`:
|
||
|
||
```yaml
|
||
# Model
|
||
base_model: "openai/clip-vit-large-patch14"
|
||
lora_r: 16 # LoRA rank (0 to disable)
|
||
lora_alpha: 32 # LoRA scaling factor
|
||
freeze_layers: 12 # Freeze first N transformer layers
|
||
|
||
# Batch construction
|
||
batch_size: 16
|
||
logos_per_batch: 32 # Different logos per batch
|
||
samples_per_logo: 4 # Samples per logo (creates positive pairs)
|
||
gradient_accumulation_steps: 8 # Effective batch = 128
|
||
|
||
# Training
|
||
learning_rate: 1.0e-5
|
||
max_epochs: 20
|
||
mixed_precision: true
|
||
temperature: 0.07 # InfoNCE temperature
|
||
|
||
# Early stopping
|
||
patience: 5
|
||
min_delta: 0.001
|
||
```
|
||
|
||
## Evaluation
|
||
|
||
### Test Fine-Tuned Model
|
||
|
||
**Important**: The fine-tuned model requires a higher threshold (0.82) than baseline (0.75).
|
||
|
||
```bash
|
||
uv run python test_logo_detection.py -n 50 \
|
||
-e models/logo_detection/clip_finetuned \
|
||
-t 0.82 \
|
||
--matching-method multi-ref \
|
||
--seed 42
|
||
```
|
||
|
||
### Compare with Baseline
|
||
|
||
```bash
|
||
# Baseline CLIP (threshold 0.75)
|
||
uv run python test_logo_detection.py -n 50 \
|
||
-e openai/clip-vit-large-patch14 \
|
||
-t 0.75 \
|
||
--matching-method multi-ref \
|
||
--seed 42
|
||
|
||
# Fine-tuned model (threshold 0.82)
|
||
uv run python test_logo_detection.py -n 50 \
|
||
-e models/logo_detection/clip_finetuned \
|
||
-t 0.82 \
|
||
--matching-method multi-ref \
|
||
--seed 42
|
||
```
|
||
|
||
### Threshold Selection
|
||
|
||
The fine-tuned model requires a **higher similarity threshold** than baseline CLIP. This is because contrastive learning successfully pushed non-matching logo similarities much lower, changing the score distribution.
|
||
|
||
#### Similarity Distribution Analysis
|
||
|
||
| Metric | Baseline | Fine-tuned |
|
||
|--------|----------|------------|
|
||
| Wrong logos mean similarity | 0.66 | **0.44** |
|
||
| Wrong logos above 0.75 | 23.2% | **0.6%** |
|
||
| Correct logos mean similarity | 0.75 | 0.64 |
|
||
| Optimal threshold | 0.756 | **0.819** |
|
||
| F1 at optimal threshold | 67.1% | **71.9%** |
|
||
|
||
**Key insight**: The fine-tuned model dramatically reduced similarities to wrong logos (from 0.66 to 0.44 mean). This means at threshold 0.75, it correctly rejects far more non-matches, but needs a higher threshold to avoid false positives from scores that bunch up just above 0.75.
|
||
|
||
#### Analyze Similarity Distribution
|
||
|
||
To find the optimal threshold for your model:
|
||
|
||
```bash
|
||
# Run detailed similarity analysis
|
||
./analyze_similarity_distribution.sh --model finetuned
|
||
|
||
# Or analyze both models
|
||
./analyze_similarity_distribution.sh --model both
|
||
```
|
||
|
||
This outputs distribution statistics and suggests an optimal threshold based on the data.
|
||
|
||
### Expected Metrics
|
||
|
||
| Metric | Baseline (t=0.75) | Fine-tuned (t=0.82) |
|
||
|--------|-------------------|---------------------|
|
||
| Precision | ~49% | >65% |
|
||
| Recall | ~77% | >70% |
|
||
| F1 Score | ~60% | >70% |
|
||
|
||
Training metrics to monitor:
|
||
- Mean positive similarity: target > 0.85
|
||
- Mean negative similarity: target < 0.50
|
||
- Embedding separation: target > 0.35
|
||
|
||
## Export Model
|
||
|
||
To export a checkpoint to HuggingFace format:
|
||
|
||
```bash
|
||
uv run python export_model.py \
|
||
--checkpoint checkpoints/best.pt \
|
||
--output models/logo_detection/clip_finetuned
|
||
```
|
||
|
||
With LoRA weight merging (reduces inference overhead):
|
||
|
||
```bash
|
||
uv run python export_model.py \
|
||
--checkpoint checkpoints/best.pt \
|
||
--output models/logo_detection/clip_finetuned \
|
||
--merge-lora
|
||
```
|
||
|
||
## Using Fine-Tuned Model with DetectLogosDETR
|
||
|
||
The fine-tuned model works as a drop-in replacement:
|
||
|
||
```python
|
||
from logo_detection_detr import DetectLogosDETR
|
||
|
||
# Use fine-tuned model
|
||
detector = DetectLogosDETR(
|
||
logger=logger,
|
||
embedding_model="models/logo_detection/clip_finetuned",
|
||
)
|
||
|
||
# Or use baseline for comparison
|
||
detector_baseline = DetectLogosDETR(
|
||
logger=logger,
|
||
embedding_model="openai/clip-vit-large-patch14",
|
||
)
|
||
```
|
||
|
||
## Architecture Details
|
||
|
||
### Training Approach
|
||
|
||
1. **Contrastive Learning**: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos.
|
||
|
||
2. **LoRA (Low-Rank Adaptation)**: Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting.
|
||
|
||
3. **Layer Freezing**: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics.
|
||
|
||
4. **Logo-Level Splits**: Splits data by logo brand (not by image) to test generalization to unseen logos.
|
||
|
||
### Batch Construction
|
||
|
||
Each batch contains:
|
||
- K different logo brands (default: 32)
|
||
- M samples per brand (default: 4)
|
||
- Total samples: K × M = 128
|
||
|
||
This ensures positive pairs (same logo) exist within each batch for contrastive learning.
|
||
|
||
### Data Augmentation
|
||
|
||
Medium strength augmentations:
|
||
- Random horizontal flip
|
||
- Random rotation (±15°)
|
||
- Color jitter (brightness, contrast, saturation)
|
||
- Random affine transforms
|
||
- Random grayscale (10% of images)
|
||
|
||
## Troubleshooting
|
||
|
||
### Out of Memory
|
||
|
||
Reduce batch size and increase gradient accumulation:
|
||
|
||
```bash
|
||
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
|
||
--batch-size 8 \
|
||
--gradient-accumulation-steps 16
|
||
```
|
||
|
||
### Slow Training
|
||
|
||
Ensure mixed precision is enabled:
|
||
|
||
```bash
|
||
uv run python train_clip_logo.py --config configs/jetson_orin.yaml
|
||
# mixed_precision: true is default in jetson_orin.yaml
|
||
```
|
||
|
||
### No Improvement
|
||
|
||
Try adjusting:
|
||
- Lower learning rate: `--learning-rate 5e-6`
|
||
- Higher temperature: `--temperature 0.1`
|
||
- Different loss: edit config to use `loss_type: "combined"`
|
||
|
||
### Import Error for Fine-Tuned Model
|
||
|
||
Ensure the `training/` module is in your Python path:
|
||
|
||
```bash
|
||
export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test"
|
||
```
|
||
|
||
## Dependencies Added
|
||
|
||
The following were added to `pyproject.toml`:
|
||
|
||
```toml
|
||
peft>=0.7.0 # LoRA support
|
||
pyyaml>=6.0 # Config file parsing
|
||
torchvision>=0.20.0 # Image transforms
|
||
```
|