Implement contrastive learning with LoRA to fine-tune CLIP's vision encoder on LogoDet-3K dataset for improved logo embedding similarity. New training module (training/): - config.py: TrainingConfig dataclass with all hyperparameters - dataset.py: LogoContrastiveDataset with logo-level splits - model.py: LogoFineTunedCLIP wrapper with LoRA support - losses.py: InfoNCE, TripletLoss, SupConLoss implementations - trainer.py: Training loop with mixed precision and checkpointing - evaluation.py: EmbeddingEvaluator for validation metrics New scripts: - train_clip_logo.py: Main training entry point - export_model.py: Export to HuggingFace-compatible format Configurations: - configs/jetson_orin.yaml: Optimized for Jetson Orin AGX - configs/cloud_rtx4090.yaml: Optimized for 24GB cloud GPUs - configs/cloud_a100.yaml: Optimized for 80GB cloud GPUs Documentation: - CLIP_FINETUNING.md: Training guide and usage instructions - CLOUD_TRAINING.md: Cloud GPU recommendations and cost estimates Modified: - logo_detection_detr.py: Add fine-tuned model loading support - pyproject.toml: Add peft, pyyaml, torchvision dependencies
6.8 KiB
CLIP Fine-Tuning for Logo Recognition
This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset.
Overview
The fine-tuning approach uses contrastive learning with LoRA (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing DetectLogosDETR class.
Goal: Improve F1 from ~60% to >72% on logo matching tasks.
Files Created
Training Module (training/)
| File | Description |
|---|---|
__init__.py |
Module exports |
config.py |
TrainingConfig dataclass with all hyperparameters |
dataset.py |
LogoContrastiveDataset with logo-level splits and augmentations |
model.py |
LogoFineTunedCLIP wrapper with LoRA support |
losses.py |
InfoNCELoss, TripletLoss, SupConLoss, CombinedLoss |
trainer.py |
Training loop with mixed precision, checkpointing, early stopping |
evaluation.py |
EmbeddingEvaluator for validation metrics |
Scripts
| File | Description |
|---|---|
train_clip_logo.py |
Main training entry point |
export_model.py |
Export trained models to HuggingFace-compatible format |
Configuration
| File | Description |
|---|---|
configs/jetson_orin.yaml |
Training config optimized for Jetson Orin AGX |
Prerequisites
-
Install dependencies:
uv sync -
Prepare test data (if not already done):
uv run python prepare_test_data.pyThis creates:
reference_logos/- Cropped logo images organized by category/brandtest_images/- Full images for testingtest_data_mapping.db- SQLite database with mappings
Training
Basic Training
uv run python train_clip_logo.py --config configs/jetson_orin.yaml
Training with Overrides
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
--learning-rate 5e-6 \
--max-epochs 30 \
--batch-size 8
Resume from Checkpoint
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
--resume checkpoints/epoch_10.pt
Training Output
- Checkpoints saved to
checkpoints/ - Best model saved as
checkpoints/best.pt - Final model exported to
models/logo_detection/clip_finetuned/
Configuration Options
Key parameters in configs/jetson_orin.yaml:
# Model
base_model: "openai/clip-vit-large-patch14"
lora_r: 16 # LoRA rank (0 to disable)
lora_alpha: 32 # LoRA scaling factor
freeze_layers: 12 # Freeze first N transformer layers
# Batch construction
batch_size: 16
logos_per_batch: 32 # Different logos per batch
samples_per_logo: 4 # Samples per logo (creates positive pairs)
gradient_accumulation_steps: 8 # Effective batch = 128
# Training
learning_rate: 1.0e-5
max_epochs: 20
mixed_precision: true
temperature: 0.07 # InfoNCE temperature
# Early stopping
patience: 5
min_delta: 0.001
Evaluation
Test Fine-Tuned Model
uv run python test_logo_detection.py -n 50 \
-e models/logo_detection/clip_finetuned \
--matching-method multi-ref \
--seed 42
Compare with Baseline
# Baseline CLIP
uv run python test_logo_detection.py -n 50 \
-e openai/clip-vit-large-patch14 \
--matching-method multi-ref \
--seed 42
# Fine-tuned model
uv run python test_logo_detection.py -n 50 \
-e models/logo_detection/clip_finetuned \
--matching-method multi-ref \
--seed 42
Expected Metrics
| Metric | Baseline CLIP | Target (Fine-tuned) |
|---|---|---|
| Precision | ~49% | >70% |
| Recall | ~77% | >75% |
| F1 Score | ~60% | >72% |
Training metrics to monitor:
- Mean positive similarity: target > 0.85
- Mean negative similarity: target < 0.50
- Embedding separation: target > 0.35
Export Model
To export a checkpoint to HuggingFace format:
uv run python export_model.py \
--checkpoint checkpoints/best.pt \
--output models/logo_detection/clip_finetuned
With LoRA weight merging (reduces inference overhead):
uv run python export_model.py \
--checkpoint checkpoints/best.pt \
--output models/logo_detection/clip_finetuned \
--merge-lora
Using Fine-Tuned Model with DetectLogosDETR
The fine-tuned model works as a drop-in replacement:
from logo_detection_detr import DetectLogosDETR
# Use fine-tuned model
detector = DetectLogosDETR(
logger=logger,
embedding_model="models/logo_detection/clip_finetuned",
)
# Or use baseline for comparison
detector_baseline = DetectLogosDETR(
logger=logger,
embedding_model="openai/clip-vit-large-patch14",
)
Architecture Details
Training Approach
-
Contrastive Learning: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos.
-
LoRA (Low-Rank Adaptation): Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting.
-
Layer Freezing: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics.
-
Logo-Level Splits: Splits data by logo brand (not by image) to test generalization to unseen logos.
Batch Construction
Each batch contains:
- K different logo brands (default: 32)
- M samples per brand (default: 4)
- Total samples: K × M = 128
This ensures positive pairs (same logo) exist within each batch for contrastive learning.
Data Augmentation
Medium strength augmentations:
- Random horizontal flip
- Random rotation (±15°)
- Color jitter (brightness, contrast, saturation)
- Random affine transforms
- Random grayscale (10% of images)
Troubleshooting
Out of Memory
Reduce batch size and increase gradient accumulation:
uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
--batch-size 8 \
--gradient-accumulation-steps 16
Slow Training
Ensure mixed precision is enabled:
uv run python train_clip_logo.py --config configs/jetson_orin.yaml
# mixed_precision: true is default in jetson_orin.yaml
No Improvement
Try adjusting:
- Lower learning rate:
--learning-rate 5e-6 - Higher temperature:
--temperature 0.1 - Different loss: edit config to use
loss_type: "combined"
Import Error for Fine-Tuned Model
Ensure the training/ module is in your Python path:
export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test"
Dependencies Added
The following were added to pyproject.toml:
peft>=0.7.0 # LoRA support
pyyaml>=6.0 # Config file parsing
torchvision>=0.20.0 # Image transforms