Files

Rick McEwen 2c41549ae0 Document margin behavior and update model comparison script

- Add section explaining how margin works differently in multi-ref vs
  margin-only matching, with examples showing why margin-only fails
  when using multiple references per logo
- Update run_model_comparison.sh to use optimal threshold (0.70) and
  margin (0.05) based on test results
- Add DINOv2 Large model test to comparison script
- Add threshold optimization test analysis to results document

2026-01-02 14:42:53 -05:00

11 KiB

Raw Blame History

Logo Detection Test Results Analysis

This document provides analysis of logo detection test results across different matching methods and configurations.

Test Run: CLIP Defaults with All Matching Methods

Date: 2025-12-31 Embedding Model: openai/clip-vit-large-patch14 (default)

Test Configuration

Parameter	Value
Reference logos	20
Refs per logo	10
Total reference embeddings	189
Positive samples per logo	20
Negative samples per logo	100
Test images processed	~2,350
Similarity threshold	0.70
DETR threshold	0.50
Margin	0.05
Min matching refs	3
Random seed	42

Results Summary

Method	TP	FP	FN	Precision	Recall	F1
Simple	751	58,221	9	1.3%	203.5%*	2.5%
Margin	60	26	310	69.8%	16.3%	26.4%
Multi-ref (mean)	233	217	170	51.8%	63.1%	56.9%
Multi-ref (max)	278	259	136	51.8%	75.3%	61.4%

*Recall >100% indicates multiple true positive detections per expected logo (multiple detected regions matching the same logo).

Analysis by Method

Simple Matching

The simple method returns ALL logos above the similarity threshold without any rejection logic. This serves as a baseline to understand the raw discriminative power of CLIP embeddings.

Observations:

58,221 false positives vs 751 true positives (~78:1 ratio)
At threshold 0.70, CLIP embeddings are not discriminative enough to distinguish between different logos
The extremely high false positive count indicates that unrelated logo regions frequently produce similarity scores above 0.70
This method is unsuitable for production use but valuable for understanding the embedding space

Margin-Based Matching

The margin method requires the best match to exceed the second-best by a minimum margin (0.05), rejecting ambiguous matches.

Observations:

Highest precision (69.8%) but very low recall (16.3%)
Only 60 true positives out of 369 expected
The margin requirement is too strict when using multiple references per logo
With 10 refs per logo, references from the SAME logo compete with each other
- Example: If Logo A has refs scoring 0.85 and 0.84, the margin is only 0.01, causing rejection
This explains why margin matching produces fewer matches than multi-ref methods

Multi-Ref Matching (Mean Similarity)

Uses the average similarity across all reference images for each logo.

Observations:

Balanced precision (51.8%) and recall (63.1%)
F1 score of 56.9%
False positive ratio approximately 1:1 with true positives (217 FP vs 233 TP)
Mean aggregation penalizes logos where some references don't match well
More conservative than max aggregation

Multi-Ref Matching (Max Similarity)

Uses the highest similarity score from any single reference image.

Observations:

Best F1 score (61.4%) and recall (75.3%)
Same precision as mean method (51.8%)
278 true positives vs 259 false positives (still approximately 1:1)
Max aggregation is more lenient, improving recall at no precision cost
Better suited when reference images capture different logo variants

Key Findings

1. CLIP Embedding Similarity Distribution

The simple matching results reveal a fundamental issue: at threshold 0.70, the CLIP embedding space does not provide sufficient separation between different logos. The 78:1 false positive to true positive ratio indicates that:

Many unrelated images produce high cosine similarity scores
The threshold would need to be significantly higher (0.85+) to reduce false positives
Even then, recall would likely suffer

2. Margin Method Limitation with Multiple References

The margin-based matching method was designed assuming one reference per logo. When using multiple references (10 per logo in this test), references from the same logo compete against each other in the margin calculation. This causes legitimate matches to be rejected when two references from the same logo have similar scores.

3. False Positive Rate Remains High

Even the best-performing method (multi-ref max) produces nearly as many false positives as true positives:

278 correct matches
259 incorrect matches
This 1:1 ratio is problematic for production use cases

4. Trade-off Between Precision and Recall

Goal	Best Method	Trade-off
Maximize precision	Margin	Very low recall (16.3%)
Maximize recall	Multi-ref (max)	Lower precision (51.8%)
Balance both	Multi-ref (max)	Best F1 but still ~50% precision

Deficiencies of This Approach

CLIP Model Limitations

General-Purpose Training: CLIP was trained on text-image pairs for general visual understanding, not for fine-grained logo discrimination. Logo matching requires distinguishing between visually similar brand marks, which CLIP's training objective doesn't optimize for.
Embedding Space Density: The cosine similarity scores cluster in a narrow range (0.6-0.9 for most images), making threshold-based discrimination difficult. Small differences in embedding similarity don't reliably indicate visual differences.
Scale and Context Sensitivity: CLIP embeddings are affected by the context around detected regions. A logo on a busy background may produce different embeddings than the same logo on a clean background.
No Logo-Specific Features: CLIP doesn't learn features specific to logo recognition such as:
- Typography and font shapes
- Brand-specific color combinations
- Geometric patterns and symmetry
- Edge and contour characteristics

Detection Pipeline Issues

DETR Detection Quality: The pipeline assumes DETR correctly identifies logo regions. Detection errors (missed logos, partial detections, non-logo regions) propagate to the matching stage.
Cropping Artifacts: Detected regions are cropped and resized before embedding extraction. This may introduce artifacts that affect embedding quality.
Threshold Sensitivity: The entire system is highly sensitive to the similarity threshold parameter. A 0.05 change in threshold can dramatically alter precision/recall balance.

Test Run: Threshold Optimization Tests

Date: 2026-01-02 Embedding Model: openai/clip-vit-large-patch14 Matching Method: Multi-ref (max) for all tests

Test Configuration

Parameter	Value
Reference logos	20
Refs per logo	10
Total reference embeddings	189
Positive samples per logo	20
Negative samples per logo	100
Test images processed	~2,355
DETR threshold	0.50
Min matching refs	3
Random seed	42

Results Summary

Test	Threshold	Margin	TP	FP	FN	Precision	Recall	F1
1 (baseline)	0.70	0.05	265	288	141	47.9%	71.8%	57.5%
2	0.80	0.05	233	472	165	33.0%	63.1%	43.4%
3	0.80	0.10	187	375	208	33.3%	50.7%	40.2%
4	0.85	0.10	160	434	223	26.9%	43.4%	33.2%
5	0.85	0.15	163	410	220	28.4%	44.2%	34.6%
6	0.90	0.15	84	69	288	54.9%	22.8%	32.2%

Analysis

Counter-Intuitive Results

The most striking finding is that raising the similarity threshold made performance worse in most cases:

Threshold Change	Effect on FP:TP Ratio
0.70 → 0.80	1.09:1 → 2.03:1 (worse)
0.80 → 0.85	2.03:1 → 2.71:1 (worse)
0.85 → 0.90	2.71:1 → 0.82:1 (better)

This is the opposite of expected behavior. Normally, raising the threshold should reduce false positives. Instead, false positives increased from 288 at threshold 0.70 to 472 at threshold 0.80.

Why Higher Thresholds Failed

The likely explanation relates to how min_matching_refs interacts with the threshold:

True positives are penalized more: Correct matches require 3+ references to exceed the threshold. At higher thresholds, fewer references clear the bar, causing legitimate matches to fail the min_matching_refs=3 requirement.
False positives survive differently: False positive detections may have 1-2 references that happen to score very high (above the threshold) due to random visual similarities. Since we use max aggregation, these spurious high scores still produce matches.
The margin becomes less effective: When most scores are clustered below the threshold, the margin check operates on a smaller pool of candidates, reducing its discriminative power.

Threshold 0.90: Different Behavior

At threshold 0.90, behavior finally matches expectations:

False positives dropped dramatically (69 vs 288-472 in other tests)
But recall collapsed to 22.8%
Only 84 true positives out of 369 expected

This suggests that at 0.90, the threshold is finally high enough to filter out most noise, but it's too aggressive and rejects most legitimate matches as well.

The Optimal Threshold Problem

Threshold	Precision	Recall	F1	Assessment
0.70	47.9%	71.8%	57.5%	Best overall F1
0.80	33.0%	63.1%	43.4%	Worse than baseline
0.85	26.9-28.4%	43-44%	33-35%	Much worse
0.90	54.9%	22.8%	32.2%	Best precision, worst recall

The lowest threshold tested (0.70) produced the best F1 score. This indicates:

CLIP embeddings don't provide clean separation at any threshold
The multi-ref matching with min_matching_refs provides better discrimination than threshold alone
Raising the threshold hurts true positives more than it helps reject false positives

Margin Parameter Impact

Comparing tests with the same threshold but different margins:

Threshold	Margin 0.05	Margin 0.10	Margin 0.15
0.80	F1: 43.4%	F1: 40.2%	-
0.85	-	F1: 33.2%	F1: 34.6%

Increasing the margin had minimal effect, slightly reducing both true and false positives. The margin parameter is less impactful than the threshold in this configuration.

Key Findings

The baseline (threshold=0.70, margin=0.05) was optimal: No threshold/margin combination tested outperformed the defaults for F1 score.
Threshold tuning alone cannot fix CLIP's limitations: The embedding space doesn't provide clear separation points that can be exploited with threshold adjustments.
min_matching_refs matters more than threshold: The requirement for multiple matching references provides better discrimination than similarity threshold.
Precision-recall trade-off is extreme: Achieving 55% precision (at threshold 0.90) requires accepting only 23% recall.
The 0.70-0.85 range is a "dead zone": Thresholds in this range produce worse results than either extreme.

Implications

These results suggest that improving logo detection accuracy requires:

A different embedding model with better logo discrimination
Logo-specific fine-tuning
Alternative matching strategies beyond threshold-based approaches
Potentially ensemble methods combining multiple signals

Simply tuning threshold and margin parameters with CLIP is insufficient to achieve acceptable precision/recall balance.

Test Run: [Next Test Name]

Results pending...

11 KiB Raw Blame History

Logo Detection Test Results Analysis

Test Run: CLIP Defaults with All Matching Methods

Test Configuration

Results Summary

Analysis by Method

Simple Matching

Margin-Based Matching

Multi-Ref Matching (Mean Similarity)

Multi-Ref Matching (Max Similarity)

Key Findings

1. CLIP Embedding Similarity Distribution

2. Margin Method Limitation with Multiple References

3. False Positive Rate Remains High

4. Trade-off Between Precision and Recall

Deficiencies of This Approach

CLIP Model Limitations

Detection Pipeline Issues

Test Run: Threshold Optimization Tests

Test Configuration

Results Summary

Analysis

Counter-Intuitive Results

Why Higher Thresholds Failed

Threshold 0.90: Different Behavior

The Optimal Threshold Problem

Margin Parameter Impact

Key Findings

Implications

Test Run: [Next Test Name]

11 KiB

Raw Blame History