Files

Rick McEwen f74d4b6981 Document threshold tuning for fine-tuned CLIP model

- Add threshold selection section with similarity distribution analysis
- Document that fine-tuned model needs threshold 0.82 (vs baseline 0.75)
- Add table comparing baseline vs fine-tuned distributions
- Update test commands to include correct thresholds
- Reference analyze_similarity_distribution.sh for threshold optimization

2026-01-05 14:09:38 -05:00

8.1 KiB

Raw Permalink Blame History

CLIP Fine-Tuning for Logo Recognition

This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset.

Overview

The fine-tuning approach uses contrastive learning with LoRA (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing DetectLogosDETR class.

Goal: Improve F1 from ~60% to >72% on logo matching tasks.

Files Created

Training Module (`training/`)

File	Description
`__init__.py`	Module exports
`config.py`	`TrainingConfig` dataclass with all hyperparameters
`dataset.py`	`LogoContrastiveDataset` with logo-level splits and augmentations
`model.py`	`LogoFineTunedCLIP` wrapper with LoRA support
`losses.py`	`InfoNCELoss`, `TripletLoss`, `SupConLoss`, `CombinedLoss`
`trainer.py`	Training loop with mixed precision, checkpointing, early stopping
`evaluation.py`	`EmbeddingEvaluator` for validation metrics

Scripts

File	Description
`train_clip_logo.py`	Main training entry point
`export_model.py`	Export trained models to HuggingFace-compatible format

Configuration

File	Description
`configs/jetson_orin.yaml`	Training config optimized for Jetson Orin AGX

Prerequisites

Install dependencies:
```
uv sync
```
Prepare test data (if not already done):
```
uv run python prepare_test_data.py
```
This creates:
- reference_logos/ - Cropped logo images organized by category/brand
- test_images/ - Full images for testing
- test_data_mapping.db - SQLite database with mappings

Training

Basic Training

uv run python train_clip_logo.py --config configs/jetson_orin.yaml

Training with Overrides

uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --learning-rate 5e-6 \
    --max-epochs 30 \
    --batch-size 8

Resume from Checkpoint

uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --resume checkpoints/epoch_10.pt

Training Output

Checkpoints saved to checkpoints/
Best model saved as checkpoints/best.pt
Final model exported to models/logo_detection/clip_finetuned/

Configuration Options

Key parameters in configs/jetson_orin.yaml:

# Model
base_model: "openai/clip-vit-large-patch14"
lora_r: 16                    # LoRA rank (0 to disable)
lora_alpha: 32                # LoRA scaling factor
freeze_layers: 12             # Freeze first N transformer layers

# Batch construction
batch_size: 16
logos_per_batch: 32           # Different logos per batch
samples_per_logo: 4           # Samples per logo (creates positive pairs)
gradient_accumulation_steps: 8  # Effective batch = 128

# Training
learning_rate: 1.0e-5
max_epochs: 20
mixed_precision: true
temperature: 0.07             # InfoNCE temperature

# Early stopping
patience: 5
min_delta: 0.001

Evaluation

Test Fine-Tuned Model

Important: The fine-tuned model requires a higher threshold (0.82) than baseline (0.75).

uv run python test_logo_detection.py -n 50 \
    -e models/logo_detection/clip_finetuned \
    -t 0.82 \
    --matching-method multi-ref \
    --seed 42

Compare with Baseline

# Baseline CLIP (threshold 0.75)
uv run python test_logo_detection.py -n 50 \
    -e openai/clip-vit-large-patch14 \
    -t 0.75 \
    --matching-method multi-ref \
    --seed 42

# Fine-tuned model (threshold 0.82)
uv run python test_logo_detection.py -n 50 \
    -e models/logo_detection/clip_finetuned \
    -t 0.82 \
    --matching-method multi-ref \
    --seed 42

Threshold Selection

The fine-tuned model requires a higher similarity threshold than baseline CLIP. This is because contrastive learning successfully pushed non-matching logo similarities much lower, changing the score distribution.

Similarity Distribution Analysis

Metric	Baseline	Fine-tuned
Wrong logos mean similarity	0.66	0.44
Wrong logos above 0.75	23.2%	0.6%
Correct logos mean similarity	0.75	0.64
Optimal threshold	0.756	0.819
F1 at optimal threshold	67.1%	71.9%

Key insight: The fine-tuned model dramatically reduced similarities to wrong logos (from 0.66 to 0.44 mean). This means at threshold 0.75, it correctly rejects far more non-matches, but needs a higher threshold to avoid false positives from scores that bunch up just above 0.75.

Analyze Similarity Distribution

To find the optimal threshold for your model:

# Run detailed similarity analysis
./analyze_similarity_distribution.sh --model finetuned

# Or analyze both models
./analyze_similarity_distribution.sh --model both

This outputs distribution statistics and suggests an optimal threshold based on the data.

Expected Metrics

Metric	Baseline (t=0.75)	Fine-tuned (t=0.82)
Precision	~49%	>65%
Recall	~77%	>70%
F1 Score	~60%	>70%

Training metrics to monitor:

Mean positive similarity: target > 0.85
Mean negative similarity: target < 0.50
Embedding separation: target > 0.35

Export Model

To export a checkpoint to HuggingFace format:

uv run python export_model.py \
    --checkpoint checkpoints/best.pt \
    --output models/logo_detection/clip_finetuned

With LoRA weight merging (reduces inference overhead):

uv run python export_model.py \
    --checkpoint checkpoints/best.pt \
    --output models/logo_detection/clip_finetuned \
    --merge-lora

Using Fine-Tuned Model with DetectLogosDETR

The fine-tuned model works as a drop-in replacement:

from logo_detection_detr import DetectLogosDETR

# Use fine-tuned model
detector = DetectLogosDETR(
    logger=logger,
    embedding_model="models/logo_detection/clip_finetuned",
)

# Or use baseline for comparison
detector_baseline = DetectLogosDETR(
    logger=logger,
    embedding_model="openai/clip-vit-large-patch14",
)

Architecture Details

Training Approach

Contrastive Learning: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos.
LoRA (Low-Rank Adaptation): Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting.
Layer Freezing: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics.
Logo-Level Splits: Splits data by logo brand (not by image) to test generalization to unseen logos.

Batch Construction

Each batch contains:

K different logo brands (default: 32)
M samples per brand (default: 4)
Total samples: K × M = 128

This ensures positive pairs (same logo) exist within each batch for contrastive learning.

Data Augmentation

Medium strength augmentations:

Random horizontal flip
Random rotation (±15°)
Color jitter (brightness, contrast, saturation)
Random affine transforms
Random grayscale (10% of images)

Troubleshooting

Out of Memory

Reduce batch size and increase gradient accumulation:

uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --batch-size 8 \
    --gradient-accumulation-steps 16

Slow Training

Ensure mixed precision is enabled:

uv run python train_clip_logo.py --config configs/jetson_orin.yaml
# mixed_precision: true is default in jetson_orin.yaml

No Improvement

Try adjusting:

Lower learning rate: --learning-rate 5e-6
Higher temperature: --temperature 0.1
Different loss: edit config to use loss_type: "combined"

Import Error for Fine-Tuned Model

Ensure the training/ module is in your Python path:

export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test"

Dependencies Added

The following were added to pyproject.toml:

peft>=0.7.0        # LoRA support
pyyaml>=6.0        # Config file parsing
torchvision>=0.20.0  # Image transforms

8.1 KiB Raw Permalink Blame History Unescape Escape