- Add threshold selection section with similarity distribution analysis - Document that fine-tuned model needs threshold 0.82 (vs baseline 0.75) - Add table comparing baseline vs fine-tuned distributions - Update test commands to include correct thresholds - Reference analyze_similarity_distribution.sh for threshold optimization
8.1 KiB
CLIP Fine-Tuning for Logo Recognition
This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset.
Overview
The fine-tuning approach uses contrastive learning with LoRA (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing DetectLogosDETR class.
Goal: Improve F1 from ~60% to >72% on logo matching tasks.
Files Created
Training Module (training/)
| File | Description |
|---|---|
__init__.py |
Module exports |
config.py |
TrainingConfig dataclass with all hyperparameters |
dataset.py |
LogoContrastiveDataset with logo-level splits and augmentations |
model.py |
LogoFineTunedCLIP wrapper with LoRA support |
losses.py |
InfoNCELoss, TripletLoss, SupConLoss, CombinedLoss |
trainer.py |
Training loop with mixed precision, checkpointing, early stopping |
evaluation.py |
EmbeddingEvaluator for validation metrics |
Scripts
| File | Description |
|---|---|
train_clip_logo.py |
Main training entry point |
export_model.py |
Export trained models to HuggingFace-compatible format |
Configuration
| File | Description |
|---|---|
configs/jetson_orin.yaml |
Training config optimized for Jetson Orin AGX |
Prerequisites
-
Install dependencies:
uv sync -
Prepare test data (if not already done):
uv run python prepare_test_data.pyThis creates:
reference_logos/- Cropped logo images organized by category/brandtest_images/- Full images for testingtest_data_mapping.db- SQLite database with mappings
Training
Basic Training
uv run python train_clip_logo.py --config configs/jetson_orin.yaml
Training with Overrides
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
--learning-rate 5e-6 \
--max-epochs 30 \
--batch-size 8
Resume from Checkpoint
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
--resume checkpoints/epoch_10.pt
Training Output
- Checkpoints saved to
checkpoints/ - Best model saved as
checkpoints/best.pt - Final model exported to
models/logo_detection/clip_finetuned/
Configuration Options
Key parameters in configs/jetson_orin.yaml:
# Model
base_model: "openai/clip-vit-large-patch14"
lora_r: 16 # LoRA rank (0 to disable)
lora_alpha: 32 # LoRA scaling factor
freeze_layers: 12 # Freeze first N transformer layers
# Batch construction
batch_size: 16
logos_per_batch: 32 # Different logos per batch
samples_per_logo: 4 # Samples per logo (creates positive pairs)
gradient_accumulation_steps: 8 # Effective batch = 128
# Training
learning_rate: 1.0e-5
max_epochs: 20
mixed_precision: true
temperature: 0.07 # InfoNCE temperature
# Early stopping
patience: 5
min_delta: 0.001
Evaluation
Test Fine-Tuned Model
Important: The fine-tuned model requires a higher threshold (0.82) than baseline (0.75).
uv run python test_logo_detection.py -n 50 \
-e models/logo_detection/clip_finetuned \
-t 0.82 \
--matching-method multi-ref \
--seed 42
Compare with Baseline
# Baseline CLIP (threshold 0.75)
uv run python test_logo_detection.py -n 50 \
-e openai/clip-vit-large-patch14 \
-t 0.75 \
--matching-method multi-ref \
--seed 42
# Fine-tuned model (threshold 0.82)
uv run python test_logo_detection.py -n 50 \
-e models/logo_detection/clip_finetuned \
-t 0.82 \
--matching-method multi-ref \
--seed 42
Threshold Selection
The fine-tuned model requires a higher similarity threshold than baseline CLIP. This is because contrastive learning successfully pushed non-matching logo similarities much lower, changing the score distribution.
Similarity Distribution Analysis
| Metric | Baseline | Fine-tuned |
|---|---|---|
| Wrong logos mean similarity | 0.66 | 0.44 |
| Wrong logos above 0.75 | 23.2% | 0.6% |
| Correct logos mean similarity | 0.75 | 0.64 |
| Optimal threshold | 0.756 | 0.819 |
| F1 at optimal threshold | 67.1% | 71.9% |
Key insight: The fine-tuned model dramatically reduced similarities to wrong logos (from 0.66 to 0.44 mean). This means at threshold 0.75, it correctly rejects far more non-matches, but needs a higher threshold to avoid false positives from scores that bunch up just above 0.75.
Analyze Similarity Distribution
To find the optimal threshold for your model:
# Run detailed similarity analysis
./analyze_similarity_distribution.sh --model finetuned
# Or analyze both models
./analyze_similarity_distribution.sh --model both
This outputs distribution statistics and suggests an optimal threshold based on the data.
Expected Metrics
| Metric | Baseline (t=0.75) | Fine-tuned (t=0.82) |
|---|---|---|
| Precision | ~49% | >65% |
| Recall | ~77% | >70% |
| F1 Score | ~60% | >70% |
Training metrics to monitor:
- Mean positive similarity: target > 0.85
- Mean negative similarity: target < 0.50
- Embedding separation: target > 0.35
Export Model
To export a checkpoint to HuggingFace format:
uv run python export_model.py \
--checkpoint checkpoints/best.pt \
--output models/logo_detection/clip_finetuned
With LoRA weight merging (reduces inference overhead):
uv run python export_model.py \
--checkpoint checkpoints/best.pt \
--output models/logo_detection/clip_finetuned \
--merge-lora
Using Fine-Tuned Model with DetectLogosDETR
The fine-tuned model works as a drop-in replacement:
from logo_detection_detr import DetectLogosDETR
# Use fine-tuned model
detector = DetectLogosDETR(
logger=logger,
embedding_model="models/logo_detection/clip_finetuned",
)
# Or use baseline for comparison
detector_baseline = DetectLogosDETR(
logger=logger,
embedding_model="openai/clip-vit-large-patch14",
)
Architecture Details
Training Approach
-
Contrastive Learning: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos.
-
LoRA (Low-Rank Adaptation): Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting.
-
Layer Freezing: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics.
-
Logo-Level Splits: Splits data by logo brand (not by image) to test generalization to unseen logos.
Batch Construction
Each batch contains:
- K different logo brands (default: 32)
- M samples per brand (default: 4)
- Total samples: K × M = 128
This ensures positive pairs (same logo) exist within each batch for contrastive learning.
Data Augmentation
Medium strength augmentations:
- Random horizontal flip
- Random rotation (±15°)
- Color jitter (brightness, contrast, saturation)
- Random affine transforms
- Random grayscale (10% of images)
Troubleshooting
Out of Memory
Reduce batch size and increase gradient accumulation:
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
--batch-size 8 \
--gradient-accumulation-steps 16
Slow Training
Ensure mixed precision is enabled:
uv run python train_clip_logo.py --config configs/jetson_orin.yaml
# mixed_precision: true is default in jetson_orin.yaml
No Improvement
Try adjusting:
- Lower learning rate:
--learning-rate 5e-6 - Higher temperature:
--temperature 0.1 - Different loss: edit config to use
loss_type: "combined"
Import Error for Fine-Tuned Model
Ensure the training/ module is in your Python path:
export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test"
Dependencies Added
The following were added to pyproject.toml:
peft>=0.7.0 # LoRA support
pyyaml>=6.0 # Config file parsing
torchvision>=0.20.0 # Image transforms