From f74d4b6981a8f5fe9c577ee1bded9fb7fab6357e Mon Sep 17 00:00:00 2001
From: Rick McEwen <mcewen@flgator.com>
Date: Mon, 5 Jan 2026 14:09:38 -0500
Subject: [PATCH] Document threshold tuning for fine-tuned CLIP model

- Add threshold selection section with similarity distribution analysis
- Document that fine-tuned model needs threshold 0.82 (vs baseline 0.75)
- Add table comparing baseline vs fine-tuned distributions
- Update test commands to include correct thresholds
- Reference analyze_similarity_distribution.sh for threshold optimization
---
 CLIP_FINETUNING.md | 49 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/CLIP_FINETUNING.md b/CLIP_FINETUNING.md
index 352b72f..92d4a18 100644
--- a/CLIP_FINETUNING.md
+++ b/CLIP_FINETUNING.md
@@ -114,9 +114,12 @@ min_delta: 0.001
 
 ### Test Fine-Tuned Model
 
+**Important**: The fine-tuned model requires a higher threshold (0.82) than baseline (0.75).
+
 ```bash
 uv run python test_logo_detection.py -n 50 \
     -e models/logo_detection/clip_finetuned \
+    -t 0.82 \
     --matching-method multi-ref \
     --seed 42
 ```
@@ -124,26 +127,58 @@ uv run python test_logo_detection.py -n 50 \
 ### Compare with Baseline
 
 ```bash
-# Baseline CLIP
+# Baseline CLIP (threshold 0.75)
 uv run python test_logo_detection.py -n 50 \
     -e openai/clip-vit-large-patch14 \
+    -t 0.75 \
     --matching-method multi-ref \
     --seed 42
 
-# Fine-tuned model
+# Fine-tuned model (threshold 0.82)
 uv run python test_logo_detection.py -n 50 \
     -e models/logo_detection/clip_finetuned \
+    -t 0.82 \
     --matching-method multi-ref \
     --seed 42
 ```
 
+### Threshold Selection
+
+The fine-tuned model requires a **higher similarity threshold** than baseline CLIP. This is because contrastive learning successfully pushed non-matching logo similarities much lower, changing the score distribution.
+
+#### Similarity Distribution Analysis
+
+| Metric | Baseline | Fine-tuned |
+|--------|----------|------------|
+| Wrong logos mean similarity | 0.66 | **0.44** |
+| Wrong logos above 0.75 | 23.2% | **0.6%** |
+| Correct logos mean similarity | 0.75 | 0.64 |
+| Optimal threshold | 0.756 | **0.819** |
+| F1 at optimal threshold | 67.1% | **71.9%** |
+
+**Key insight**: The fine-tuned model dramatically reduced similarities to wrong logos (from 0.66 to 0.44 mean). This means at threshold 0.75, it correctly rejects far more non-matches, but needs a higher threshold to avoid false positives from scores that bunch up just above 0.75.
+
+#### Analyze Similarity Distribution
+
+To find the optimal threshold for your model:
+
+```bash
+# Run detailed similarity analysis
+./analyze_similarity_distribution.sh --model finetuned
+
+# Or analyze both models
+./analyze_similarity_distribution.sh --model both
+```
+
+This outputs distribution statistics and suggests an optimal threshold based on the data.
+
 ### Expected Metrics
 
-| Metric | Baseline CLIP | Target (Fine-tuned) |
-|--------|---------------|---------------------|
-| Precision | ~49% | >70% |
-| Recall | ~77% | >75% |
-| F1 Score | ~60% | >72% |
+| Metric | Baseline (t=0.75) | Fine-tuned (t=0.82) |
+|--------|-------------------|---------------------|
+| Precision | ~49% | >65% |
+| Recall | ~77% | >70% |
+| F1 Score | ~60% | >70% |
 
 Training metrics to monitor:
 - Mean positive similarity: target > 0.85