logo_test/CLIP_FINETUNING.md

# CLIP Fine-Tuning for Logo Recognition

This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset.

## Overview

The fine-tuning approach uses **contrastive learning** with **LoRA** (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing `DetectLogosDETR` class.

**Goal**: Improve F1 from ~60% to >72% on logo matching tasks.

## Files Created

### Training Module (`training/`)

| File | Description |
|------|-------------|
| `__init__.py` | Module exports |
| `config.py` | `TrainingConfig` dataclass with all hyperparameters |
| `dataset.py` | `LogoContrastiveDataset` with logo-level splits and augmentations |
| `model.py` | `LogoFineTunedCLIP` wrapper with LoRA support |
| `losses.py` | `InfoNCELoss`, `TripletLoss`, `SupConLoss`, `CombinedLoss` |
| `trainer.py` | Training loop with mixed precision, checkpointing, early stopping |
| `evaluation.py` | `EmbeddingEvaluator` for validation metrics |

### Scripts

| File | Description |
|------|-------------|
| `train_clip_logo.py` | Main training entry point |
| `export_model.py` | Export trained models to HuggingFace-compatible format |

### Configuration

| File | Description |
|------|-------------|
| `configs/jetson_orin.yaml` | Training config optimized for Jetson Orin AGX |

## Prerequisites

1. **Install dependencies**:
   ```bash
   uv sync
   ```

2. **Prepare test data** (if not already done):
   ```bash
   uv run python prepare_test_data.py
   ```

   This creates:
   - `reference_logos/` - Cropped logo images organized by category/brand
   - `test_images/` - Full images for testing
   - `test_data_mapping.db` - SQLite database with mappings

## Training

### Basic Training

```bash
uv run python train_clip_logo.py --config configs/jetson_orin.yaml
```

### Training with Overrides

```bash
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --learning-rate 5e-6 \
    --max-epochs 30 \
    --batch-size 8
```

### Resume from Checkpoint

```bash
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --resume checkpoints/epoch_10.pt
```

### Training Output

- Checkpoints saved to `checkpoints/`
- Best model saved as `checkpoints/best.pt`
- Final model exported to `models/logo_detection/clip_finetuned/`

## Configuration Options

Key parameters in `configs/jetson_orin.yaml`:

```yaml
# Model
base_model: "openai/clip-vit-large-patch14"
lora_r: 16                    # LoRA rank (0 to disable)
lora_alpha: 32                # LoRA scaling factor
freeze_layers: 12             # Freeze first N transformer layers

# Batch construction
batch_size: 16
logos_per_batch: 32           # Different logos per batch
samples_per_logo: 4           # Samples per logo (creates positive pairs)
gradient_accumulation_steps: 8  # Effective batch = 128

# Training
learning_rate: 1.0e-5
max_epochs: 20
mixed_precision: true
temperature: 0.07             # InfoNCE temperature

# Early stopping
patience: 5
min_delta: 0.001
```

## Evaluation

### Test Fine-Tuned Model

**Important**: The fine-tuned model requires a higher threshold (0.82) than baseline (0.75).

```bash
uv run python test_logo_detection.py -n 50 \
    -e models/logo_detection/clip_finetuned \
    -t 0.82 \
    --matching-method multi-ref \
    --seed 42
```

### Compare with Baseline

```bash
# Baseline CLIP (threshold 0.75)
uv run python test_logo_detection.py -n 50 \
    -e openai/clip-vit-large-patch14 \
    -t 0.75 \
    --matching-method multi-ref \
    --seed 42

# Fine-tuned model (threshold 0.82)
uv run python test_logo_detection.py -n 50 \
    -e models/logo_detection/clip_finetuned \
    -t 0.82 \
    --matching-method multi-ref \
    --seed 42
```

### Threshold Selection

The fine-tuned model requires a **higher similarity threshold** than baseline CLIP. This is because contrastive learning successfully pushed non-matching logo similarities much lower, changing the score distribution.

#### Similarity Distribution Analysis

| Metric | Baseline | Fine-tuned |
|--------|----------|------------|
| Wrong logos mean similarity | 0.66 | **0.44** |
| Wrong logos above 0.75 | 23.2% | **0.6%** |
| Correct logos mean similarity | 0.75 | 0.64 |
| Optimal threshold | 0.756 | **0.819** |
| F1 at optimal threshold | 67.1% | **71.9%** |

**Key insight**: The fine-tuned model dramatically reduced similarities to wrong logos (from 0.66 to 0.44 mean). This means at threshold 0.75, it correctly rejects far more non-matches, but needs a higher threshold to avoid false positives from scores that bunch up just above 0.75.

#### Analyze Similarity Distribution

To find the optimal threshold for your model:

```bash
# Run detailed similarity analysis
./analyze_similarity_distribution.sh --model finetuned

# Or analyze both models
./analyze_similarity_distribution.sh --model both
```

This outputs distribution statistics and suggests an optimal threshold based on the data.

### Expected Metrics

| Metric | Baseline (t=0.75) | Fine-tuned (t=0.82) |
|--------|-------------------|---------------------|
| Precision | ~49% | >65% |
| Recall | ~77% | >70% |
| F1 Score | ~60% | >70% |

Training metrics to monitor:
- Mean positive similarity: target > 0.85
- Mean negative similarity: target < 0.50
- Embedding separation: target > 0.35

## Export Model

To export a checkpoint to HuggingFace format:

```bash
uv run python export_model.py \
    --checkpoint checkpoints/best.pt \
    --output models/logo_detection/clip_finetuned
```

With LoRA weight merging (reduces inference overhead):

```bash
uv run python export_model.py \
    --checkpoint checkpoints/best.pt \
    --output models/logo_detection/clip_finetuned \
    --merge-lora
```

## Using Fine-Tuned Model with DetectLogosDETR

The fine-tuned model works as a drop-in replacement:

```python
from logo_detection_detr import DetectLogosDETR

# Use fine-tuned model
detector = DetectLogosDETR(
    logger=logger,
    embedding_model="models/logo_detection/clip_finetuned",
)

# Or use baseline for comparison
detector_baseline = DetectLogosDETR(
    logger=logger,
    embedding_model="openai/clip-vit-large-patch14",
)
```

## Architecture Details

### Training Approach

1. **Contrastive Learning**: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos.

2. **LoRA (Low-Rank Adaptation)**: Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting.

3. **Layer Freezing**: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics.

4. **Logo-Level Splits**: Splits data by logo brand (not by image) to test generalization to unseen logos.

### Batch Construction

Each batch contains:
- K different logo brands (default: 32)
- M samples per brand (default: 4)
- Total samples: K × M = 128

This ensures positive pairs (same logo) exist within each batch for contrastive learning.

### Data Augmentation

Medium strength augmentations:
- Random horizontal flip
- Random rotation (±15°)
- Color jitter (brightness, contrast, saturation)
- Random affine transforms
- Random grayscale (10% of images)

## Troubleshooting

### Out of Memory

Reduce batch size and increase gradient accumulation:

```bash
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --batch-size 8 \
    --gradient-accumulation-steps 16
```

### Slow Training

Ensure mixed precision is enabled:

```bash
uv run python train_clip_logo.py --config configs/jetson_orin.yaml
# mixed_precision: true is default in jetson_orin.yaml
```

### No Improvement

Try adjusting:
- Lower learning rate: `--learning-rate 5e-6`
- Higher temperature: `--temperature 0.1`
- Different loss: edit config to use `loss_type: "combined"`

### Import Error for Fine-Tuned Model

Ensure the `training/` module is in your Python path:

```bash
export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test"
```

## Dependencies Added

The following were added to `pyproject.toml`:

```toml
peft>=0.7.0        # LoRA support
pyyaml>=6.0        # Config file parsing
torchvision>=0.20.0  # Image transforms
```