Files
logo_test/CLIP_FINETUNING.md
Rick McEwen f74d4b6981 Document threshold tuning for fine-tuned CLIP model
- Add threshold selection section with similarity distribution analysis
- Document that fine-tuned model needs threshold 0.82 (vs baseline 0.75)
- Add table comparing baseline vs fine-tuned distributions
- Update test commands to include correct thresholds
- Reference analyze_similarity_distribution.sh for threshold optimization
2026-01-05 14:09:38 -05:00

8.1 KiB
Raw Blame History

CLIP Fine-Tuning for Logo Recognition

This document describes the CLIP fine-tuning pipeline for improving logo embedding similarity using the LogoDet-3K dataset.

Overview

The fine-tuning approach uses contrastive learning with LoRA (Low-Rank Adaptation) to train CLIP's vision encoder for better logo similarity matching while maintaining compatibility with the existing DetectLogosDETR class.

Goal: Improve F1 from ~60% to >72% on logo matching tasks.

Files Created

Training Module (training/)

File Description
__init__.py Module exports
config.py TrainingConfig dataclass with all hyperparameters
dataset.py LogoContrastiveDataset with logo-level splits and augmentations
model.py LogoFineTunedCLIP wrapper with LoRA support
losses.py InfoNCELoss, TripletLoss, SupConLoss, CombinedLoss
trainer.py Training loop with mixed precision, checkpointing, early stopping
evaluation.py EmbeddingEvaluator for validation metrics

Scripts

File Description
train_clip_logo.py Main training entry point
export_model.py Export trained models to HuggingFace-compatible format

Configuration

File Description
configs/jetson_orin.yaml Training config optimized for Jetson Orin AGX

Prerequisites

  1. Install dependencies:

    uv sync
    
  2. Prepare test data (if not already done):

    uv run python prepare_test_data.py
    

    This creates:

    • reference_logos/ - Cropped logo images organized by category/brand
    • test_images/ - Full images for testing
    • test_data_mapping.db - SQLite database with mappings

Training

Basic Training

uv run python train_clip_logo.py --config configs/jetson_orin.yaml

Training with Overrides

uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --learning-rate 5e-6 \
    --max-epochs 30 \
    --batch-size 8

Resume from Checkpoint

uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --resume checkpoints/epoch_10.pt

Training Output

  • Checkpoints saved to checkpoints/
  • Best model saved as checkpoints/best.pt
  • Final model exported to models/logo_detection/clip_finetuned/

Configuration Options

Key parameters in configs/jetson_orin.yaml:

# Model
base_model: "openai/clip-vit-large-patch14"
lora_r: 16                    # LoRA rank (0 to disable)
lora_alpha: 32                # LoRA scaling factor
freeze_layers: 12             # Freeze first N transformer layers

# Batch construction
batch_size: 16
logos_per_batch: 32           # Different logos per batch
samples_per_logo: 4           # Samples per logo (creates positive pairs)
gradient_accumulation_steps: 8  # Effective batch = 128

# Training
learning_rate: 1.0e-5
max_epochs: 20
mixed_precision: true
temperature: 0.07             # InfoNCE temperature

# Early stopping
patience: 5
min_delta: 0.001

Evaluation

Test Fine-Tuned Model

Important: The fine-tuned model requires a higher threshold (0.82) than baseline (0.75).

uv run python test_logo_detection.py -n 50 \
    -e models/logo_detection/clip_finetuned \
    -t 0.82 \
    --matching-method multi-ref \
    --seed 42

Compare with Baseline

# Baseline CLIP (threshold 0.75)
uv run python test_logo_detection.py -n 50 \
    -e openai/clip-vit-large-patch14 \
    -t 0.75 \
    --matching-method multi-ref \
    --seed 42

# Fine-tuned model (threshold 0.82)
uv run python test_logo_detection.py -n 50 \
    -e models/logo_detection/clip_finetuned \
    -t 0.82 \
    --matching-method multi-ref \
    --seed 42

Threshold Selection

The fine-tuned model requires a higher similarity threshold than baseline CLIP. This is because contrastive learning successfully pushed non-matching logo similarities much lower, changing the score distribution.

Similarity Distribution Analysis

Metric Baseline Fine-tuned
Wrong logos mean similarity 0.66 0.44
Wrong logos above 0.75 23.2% 0.6%
Correct logos mean similarity 0.75 0.64
Optimal threshold 0.756 0.819
F1 at optimal threshold 67.1% 71.9%

Key insight: The fine-tuned model dramatically reduced similarities to wrong logos (from 0.66 to 0.44 mean). This means at threshold 0.75, it correctly rejects far more non-matches, but needs a higher threshold to avoid false positives from scores that bunch up just above 0.75.

Analyze Similarity Distribution

To find the optimal threshold for your model:

# Run detailed similarity analysis
./analyze_similarity_distribution.sh --model finetuned

# Or analyze both models
./analyze_similarity_distribution.sh --model both

This outputs distribution statistics and suggests an optimal threshold based on the data.

Expected Metrics

Metric Baseline (t=0.75) Fine-tuned (t=0.82)
Precision ~49% >65%
Recall ~77% >70%
F1 Score ~60% >70%

Training metrics to monitor:

  • Mean positive similarity: target > 0.85
  • Mean negative similarity: target < 0.50
  • Embedding separation: target > 0.35

Export Model

To export a checkpoint to HuggingFace format:

uv run python export_model.py \
    --checkpoint checkpoints/best.pt \
    --output models/logo_detection/clip_finetuned

With LoRA weight merging (reduces inference overhead):

uv run python export_model.py \
    --checkpoint checkpoints/best.pt \
    --output models/logo_detection/clip_finetuned \
    --merge-lora

Using Fine-Tuned Model with DetectLogosDETR

The fine-tuned model works as a drop-in replacement:

from logo_detection_detr import DetectLogosDETR

# Use fine-tuned model
detector = DetectLogosDETR(
    logger=logger,
    embedding_model="models/logo_detection/clip_finetuned",
)

# Or use baseline for comparison
detector_baseline = DetectLogosDETR(
    logger=logger,
    embedding_model="openai/clip-vit-large-patch14",
)

Architecture Details

Training Approach

  1. Contrastive Learning: Uses InfoNCE loss to maximize similarity between embeddings of the same logo while minimizing similarity to different logos.

  2. LoRA (Low-Rank Adaptation): Adds small trainable matrices to attention layers instead of fine-tuning all weights. This is memory-efficient and prevents catastrophic forgetting.

  3. Layer Freezing: Freezes the first 12 of 24 transformer layers to preserve CLIP's low-level visual features while adapting high-level semantics.

  4. Logo-Level Splits: Splits data by logo brand (not by image) to test generalization to unseen logos.

Batch Construction

Each batch contains:

  • K different logo brands (default: 32)
  • M samples per brand (default: 4)
  • Total samples: K × M = 128

This ensures positive pairs (same logo) exist within each batch for contrastive learning.

Data Augmentation

Medium strength augmentations:

  • Random horizontal flip
  • Random rotation (±15°)
  • Color jitter (brightness, contrast, saturation)
  • Random affine transforms
  • Random grayscale (10% of images)

Troubleshooting

Out of Memory

Reduce batch size and increase gradient accumulation:

uv run python train_clip_logo.py --config configs/jetson_orin.yaml \
    --batch-size 8 \
    --gradient-accumulation-steps 16

Slow Training

Ensure mixed precision is enabled:

uv run python train_clip_logo.py --config configs/jetson_orin.yaml
# mixed_precision: true is default in jetson_orin.yaml

No Improvement

Try adjusting:

  • Lower learning rate: --learning-rate 5e-6
  • Higher temperature: --temperature 0.1
  • Different loss: edit config to use loss_type: "combined"

Import Error for Fine-Tuned Model

Ensure the training/ module is in your Python path:

export PYTHONPATH="${PYTHONPATH}:/data/dev.python/logo_test"

Dependencies Added

The following were added to pyproject.toml:

peft>=0.7.0        # LoRA support
pyyaml>=6.0        # Config file parsing
torchvision>=0.20.0  # Image transforms