diff --git a/CLIP_FINETUNING.md b/CLIP_FINETUNING.md index 352b72f..92d4a18 100644 --- a/CLIP_FINETUNING.md +++ b/CLIP_FINETUNING.md @@ -114,9 +114,12 @@ min_delta: 0.001 ### Test Fine-Tuned Model +**Important**: The fine-tuned model requires a higher threshold (0.82) than baseline (0.75). + ```bash uv run python test_logo_detection.py -n 50 \ -e models/logo_detection/clip_finetuned \ + -t 0.82 \ --matching-method multi-ref \ --seed 42 ``` @@ -124,26 +127,58 @@ uv run python test_logo_detection.py -n 50 \ ### Compare with Baseline ```bash -# Baseline CLIP +# Baseline CLIP (threshold 0.75) uv run python test_logo_detection.py -n 50 \ -e openai/clip-vit-large-patch14 \ + -t 0.75 \ --matching-method multi-ref \ --seed 42 -# Fine-tuned model +# Fine-tuned model (threshold 0.82) uv run python test_logo_detection.py -n 50 \ -e models/logo_detection/clip_finetuned \ + -t 0.82 \ --matching-method multi-ref \ --seed 42 ``` +### Threshold Selection + +The fine-tuned model requires a **higher similarity threshold** than baseline CLIP. This is because contrastive learning successfully pushed non-matching logo similarities much lower, changing the score distribution. + +#### Similarity Distribution Analysis + +| Metric | Baseline | Fine-tuned | +|--------|----------|------------| +| Wrong logos mean similarity | 0.66 | **0.44** | +| Wrong logos above 0.75 | 23.2% | **0.6%** | +| Correct logos mean similarity | 0.75 | 0.64 | +| Optimal threshold | 0.756 | **0.819** | +| F1 at optimal threshold | 67.1% | **71.9%** | + +**Key insight**: The fine-tuned model dramatically reduced similarities to wrong logos (from 0.66 to 0.44 mean). This means at threshold 0.75, it correctly rejects far more non-matches, but needs a higher threshold to avoid false positives from scores that bunch up just above 0.75. + +#### Analyze Similarity Distribution + +To find the optimal threshold for your model: + +```bash +# Run detailed similarity analysis +./analyze_similarity_distribution.sh --model finetuned + +# Or analyze both models +./analyze_similarity_distribution.sh --model both +``` + +This outputs distribution statistics and suggests an optimal threshold based on the data. + ### Expected Metrics -| Metric | Baseline CLIP | Target (Fine-tuned) | -|--------|---------------|---------------------| -| Precision | ~49% | >70% | -| Recall | ~77% | >75% | -| F1 Score | ~60% | >72% | +| Metric | Baseline (t=0.75) | Fine-tuned (t=0.82) | +|--------|-------------------|---------------------| +| Precision | ~49% | >65% | +| Recall | ~77% | >70% | +| F1 Score | ~60% | >70% | Training metrics to monitor: - Mean positive similarity: target > 0.85