Files
logo_test/CLIP_FINETUNING.md
Rick McEwen 44e8b6ae7d Add CLIP fine-tuning pipeline for logo recognition
Implement contrastive learning with LoRA to fine-tune CLIP's vision
encoder on LogoDet-3K dataset for improved logo embedding similarity.

New training module (training/):
- config.py: TrainingConfig dataclass with all hyperparameters
- dataset.py: LogoContrastiveDataset with logo-level splits
- model.py: LogoFineTunedCLIP wrapper with LoRA support
- losses.py: InfoNCE, TripletLoss, SupConLoss implementations
- trainer.py: Training loop with mixed precision and checkpointing
- evaluation.py: EmbeddingEvaluator for validation metrics

New scripts:
- train_clip_logo.py: Main training entry point
- export_model.py: Export to HuggingFace-compatible format

Configurations:
- configs/jetson_orin.yaml: Optimized for Jetson Orin AGX
- configs/cloud_rtx4090.yaml: Optimized for 24GB cloud GPUs
- configs/cloud_a100.yaml: Optimized for 80GB cloud GPUs

Documentation:
- CLIP_FINETUNING.md: Training guide and usage instructions
- CLOUD_TRAINING.md: Cloud GPU recommendations and cost estimates

Modified:
- logo_detection_detr.py: Add fine-tuned model loading support
- pyproject.toml: Add peft, pyyaml, torchvision dependencies
2026-01-04 13:45:25 -05:00

6.8 KiB
Raw Blame History

CLIP Fine-Tuning for Logo Recognition

This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset.

Overview

The fine-tuning approach uses contrastive learning with LoRA (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing DetectLogosDETR class.

Goal: Improve F1 from ~60% to >72% on logo matching tasks.

Files Created

Training Module (training/)

File Description
__init__.py Module exports
config.py TrainingConfig dataclass with all hyperparameters
dataset.py LogoContrastiveDataset with logo-level splits and augmentations
model.py LogoFineTunedCLIP wrapper with LoRA support
losses.py InfoNCELoss, TripletLoss, SupConLoss, CombinedLoss
trainer.py Training loop with mixed precision, checkpointing, early stopping
evaluation.py EmbeddingEvaluator for validation metrics

Scripts

File Description
train_clip_logo.py Main training entry point
export_model.py Export trained models to HuggingFace-compatible format

Configuration

File Description
configs/jetson_orin.yaml Training config optimized for Jetson Orin AGX

Prerequisites

  1. Install dependencies:

    uv sync
    
  2. Prepare test data (if not already done):

    uv run python prepare_test_data.py
    

    This creates:

    • reference_logos/ - Cropped logo images organized by category/brand
    • test_images/ - Full images for testing
    • test_data_mapping.db - SQLite database with mappings

Training

Basic Training

uv run python train_clip_logo.py --config configs/jetson_orin.yaml

Training with Overrides

uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --learning-rate 5e-6 \
    --max-epochs 30 \
    --batch-size 8

Resume from Checkpoint

uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --resume checkpoints/epoch_10.pt

Training Output

  • Checkpoints saved to checkpoints/
  • Best model saved as checkpoints/best.pt
  • Final model exported to models/logo_detection/clip_finetuned/

Configuration Options

Key parameters in configs/jetson_orin.yaml:

# Model
base_model: "openai/clip-vit-large-patch14"
lora_r: 16                    # LoRA rank (0 to disable)
lora_alpha: 32                # LoRA scaling factor
freeze_layers: 12             # Freeze first N transformer layers

# Batch construction
batch_size: 16
logos_per_batch: 32           # Different logos per batch
samples_per_logo: 4           # Samples per logo (creates positive pairs)
gradient_accumulation_steps: 8  # Effective batch = 128

# Training
learning_rate: 1.0e-5
max_epochs: 20
mixed_precision: true
temperature: 0.07             # InfoNCE temperature

# Early stopping
patience: 5
min_delta: 0.001

Evaluation

Test Fine-Tuned Model

uv run python test_logo_detection.py -n 50 \
    -e models/logo_detection/clip_finetuned \
    --matching-method multi-ref \
    --seed 42

Compare with Baseline

# Baseline CLIP
uv run python test_logo_detection.py -n 50 \
    -e openai/clip-vit-large-patch14 \
    --matching-method multi-ref \
    --seed 42

# Fine-tuned model
uv run python test_logo_detection.py -n 50 \
    -e models/logo_detection/clip_finetuned \
    --matching-method multi-ref \
    --seed 42

Expected Metrics

Metric Baseline CLIP Target (Fine-tuned)
Precision ~49% >70%
Recall ~77% >75%
F1 Score ~60% >72%

Training metrics to monitor:

  • Mean positive similarity: target > 0.85
  • Mean negative similarity: target < 0.50
  • Embedding separation: target > 0.35

Export Model

To export a checkpoint to HuggingFace format:

uv run python export_model.py \
    --checkpoint checkpoints/best.pt \
    --output models/logo_detection/clip_finetuned

With LoRA weight merging (reduces inference overhead):

uv run python export_model.py \
    --checkpoint checkpoints/best.pt \
    --output models/logo_detection/clip_finetuned \
    --merge-lora

Using Fine-Tuned Model with DetectLogosDETR

The fine-tuned model works as a drop-in replacement:

from logo_detection_detr import DetectLogosDETR

# Use fine-tuned model
detector = DetectLogosDETR(
    logger=logger,
    embedding_model="models/logo_detection/clip_finetuned",
)

# Or use baseline for comparison
detector_baseline = DetectLogosDETR(
    logger=logger,
    embedding_model="openai/clip-vit-large-patch14",
)

Architecture Details

Training Approach

  1. Contrastive Learning: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos.

  2. LoRA (Low-Rank Adaptation): Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting.

  3. Layer Freezing: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics.

  4. Logo-Level Splits: Splits data by logo brand (not by image) to test generalization to unseen logos.

Batch Construction

Each batch contains:

  • K different logo brands (default: 32)
  • M samples per brand (default: 4)
  • Total samples: K × M = 128

This ensures positive pairs (same logo) exist within each batch for contrastive learning.

Data Augmentation

Medium strength augmentations:

  • Random horizontal flip
  • Random rotation (±15°)
  • Color jitter (brightness, contrast, saturation)
  • Random affine transforms
  • Random grayscale (10% of images)

Troubleshooting

Out of Memory

Reduce batch size and increase gradient accumulation:

uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --batch-size 8 \
    --gradient-accumulation-steps 16

Slow Training

Ensure mixed precision is enabled:

uv run python train_clip_logo.py --config configs/jetson_orin.yaml
# mixed_precision: true is default in jetson_orin.yaml

No Improvement

Try adjusting:

  • Lower learning rate: --learning-rate 5e-6
  • Higher temperature: --temperature 0.1
  • Different loss: edit config to use loss_type: "combined"

Import Error for Fine-Tuned Model

Ensure the training/ module is in your Python path:

export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test"

Dependencies Added

The following were added to pyproject.toml:

peft>=0.7.0        # LoRA support
pyyaml>=6.0        # Config file parsing
torchvision>=0.20.0  # Image transforms