Document margin behavior and update model comparison script
- Add section explaining how margin works differently in multi-ref vs margin-only matching, with examples showing why margin-only fails when using multiple references per logo - Update run_model_comparison.sh to use optimal threshold (0.70) and margin (0.05) based on test results - Add DINOv2 Large model test to comparison script - Add threshold optimization test analysis to results document
This commit is contained in:
@ -224,6 +224,51 @@ This ensures confident matches and reduces ambiguous classifications.
|
||||
- Margin required: 0.05
|
||||
- Result: **No match** (0.82 - 0.79 = 0.03 < 0.05)
|
||||
|
||||
#### Margin in Multi-Ref vs Margin-Only Matching
|
||||
|
||||
The margin parameter applies to both `margin` and `multi-ref` methods, but operates at different levels:
|
||||
|
||||
| Method | What Margin Compares |
|
||||
|--------|---------------------|
|
||||
| `margin` | Best **reference embedding** vs second-best **reference embedding** |
|
||||
| `multi-ref` | Best **logo's aggregated score** vs second-best **logo's aggregated score** |
|
||||
|
||||
This distinction is critical when using multiple references per logo.
|
||||
|
||||
#### The Problem with Margin-Only and Multiple References
|
||||
|
||||
In margin-only matching, all individual reference embeddings compete against each other—including references from the **same logo**. This causes legitimate matches to be rejected.
|
||||
|
||||
**Example showing the problem:**
|
||||
|
||||
Suppose Nike has 3 references and Adidas has 3 references. A detected region produces:
|
||||
|
||||
| Reference | Similarity |
|
||||
|-----------|------------|
|
||||
| Nike_ref1 | 0.92 |
|
||||
| Nike_ref2 | 0.91 |
|
||||
| Nike_ref3 | 0.85 |
|
||||
| Adidas_ref1 | 0.78 |
|
||||
| Adidas_ref2 | 0.75 |
|
||||
| Adidas_ref3 | 0.72 |
|
||||
|
||||
**With margin-only matching (margin=0.05):**
|
||||
- Best reference: Nike_ref1 (0.92)
|
||||
- Second-best reference: Nike_ref2 (0.91) ← Same logo!
|
||||
- Margin check: 0.92 - 0.91 = 0.01 < 0.05 → **Rejected**
|
||||
|
||||
The match is rejected even though this is clearly a Nike logo! Nike's own references compete against each other and fail the margin test.
|
||||
|
||||
**With multi-ref matching (margin=0.05):**
|
||||
- First, aggregate scores per logo:
|
||||
- Nike: max(0.92, 0.91, 0.85) = 0.92
|
||||
- Adidas: max(0.78, 0.75, 0.72) = 0.78
|
||||
- Best logo: Nike (0.92)
|
||||
- Second-best logo: Adidas (0.78)
|
||||
- Margin check: 0.92 - 0.78 = 0.14 >= 0.05 → **Accepted**
|
||||
|
||||
This is why margin-only matching produces very low recall when using multiple references per logo—it was designed for single-reference scenarios.
|
||||
|
||||
---
|
||||
|
||||
### 6. Embedding Caching
|
||||
|
||||
@ -13,8 +13,8 @@ REFS_PER_LOGO=10
|
||||
POSITIVE_SAMPLES=20
|
||||
NEGATIVE_SAMPLES=100
|
||||
MIN_MATCHING_REFS=3
|
||||
THRESHOLD=0.80
|
||||
MARGIN=0.10
|
||||
THRESHOLD=0.70
|
||||
MARGIN=0.05
|
||||
SEED=42
|
||||
|
||||
# Clear output file and write header
|
||||
@ -82,11 +82,29 @@ uv run python "$SCRIPT_DIR/test_logo_detection.py" \
|
||||
--clear-cache \
|
||||
--output-file "$OUTPUT_FILE"
|
||||
|
||||
echo ""
|
||||
|
||||
# Test 3: DINOv2 Large
|
||||
echo "=== Test 3: DINOv2 Large (facebook/dinov2-large) ==="
|
||||
uv run python "$SCRIPT_DIR/test_logo_detection.py" \
|
||||
--num-logos $NUM_LOGOS \
|
||||
--refs-per-logo $REFS_PER_LOGO \
|
||||
--positive-samples $POSITIVE_SAMPLES \
|
||||
--negative-samples $NEGATIVE_SAMPLES \
|
||||
--matching-method multi-ref \
|
||||
--min-matching-refs $MIN_MATCHING_REFS \
|
||||
--use-max-similarity \
|
||||
--threshold $THRESHOLD \
|
||||
--margin $MARGIN \
|
||||
--seed $SEED \
|
||||
--embedding-model "facebook/dinov2-large" \
|
||||
--clear-cache \
|
||||
--output-file "$OUTPUT_FILE"
|
||||
|
||||
echo ""
|
||||
echo "Results saved to: $OUTPUT_FILE"
|
||||
echo ""
|
||||
echo "Note: You can also try other models:"
|
||||
echo " - facebook/dinov2-base"
|
||||
echo " - facebook/dinov2-large"
|
||||
echo " - openai/clip-vit-base-patch32"
|
||||
echo " - openai/clip-vit-large-patch14-336"
|
||||
@ -137,6 +137,119 @@ Even the best-performing method (multi-ref max) produces nearly as many false po
|
||||
|
||||
---
|
||||
|
||||
## Test Run: Threshold Optimization Tests
|
||||
|
||||
**Date**: 2026-01-02
|
||||
**Embedding Model**: openai/clip-vit-large-patch14
|
||||
**Matching Method**: Multi-ref (max) for all tests
|
||||
|
||||
### Test Configuration
|
||||
|
||||
| Parameter | Value |
|
||||
|-----------|-------|
|
||||
| Reference logos | 20 |
|
||||
| Refs per logo | 10 |
|
||||
| Total reference embeddings | 189 |
|
||||
| Positive samples per logo | 20 |
|
||||
| Negative samples per logo | 100 |
|
||||
| Test images processed | ~2,355 |
|
||||
| DETR threshold | 0.50 |
|
||||
| Min matching refs | 3 |
|
||||
| Random seed | 42 |
|
||||
|
||||
### Results Summary
|
||||
|
||||
| Test | Threshold | Margin | TP | FP | FN | Precision | Recall | F1 |
|
||||
|------|----------:|-------:|---:|---:|---:|----------:|-------:|---:|
|
||||
| 1 (baseline) | 0.70 | 0.05 | 265 | 288 | 141 | 47.9% | 71.8% | 57.5% |
|
||||
| 2 | 0.80 | 0.05 | 233 | 472 | 165 | 33.0% | 63.1% | 43.4% |
|
||||
| 3 | 0.80 | 0.10 | 187 | 375 | 208 | 33.3% | 50.7% | 40.2% |
|
||||
| 4 | 0.85 | 0.10 | 160 | 434 | 223 | 26.9% | 43.4% | 33.2% |
|
||||
| 5 | 0.85 | 0.15 | 163 | 410 | 220 | 28.4% | 44.2% | 34.6% |
|
||||
| 6 | 0.90 | 0.15 | 84 | 69 | 288 | 54.9% | 22.8% | 32.2% |
|
||||
|
||||
### Analysis
|
||||
|
||||
#### Counter-Intuitive Results
|
||||
|
||||
The most striking finding is that **raising the similarity threshold made performance worse** in most cases:
|
||||
|
||||
| Threshold Change | Effect on FP:TP Ratio |
|
||||
|------------------|----------------------|
|
||||
| 0.70 → 0.80 | 1.09:1 → 2.03:1 (worse) |
|
||||
| 0.80 → 0.85 | 2.03:1 → 2.71:1 (worse) |
|
||||
| 0.85 → 0.90 | 2.71:1 → 0.82:1 (better) |
|
||||
|
||||
This is the opposite of expected behavior. Normally, raising the threshold should reduce false positives. Instead, false positives *increased* from 288 at threshold 0.70 to 472 at threshold 0.80.
|
||||
|
||||
#### Why Higher Thresholds Failed
|
||||
|
||||
The likely explanation relates to how `min_matching_refs` interacts with the threshold:
|
||||
|
||||
1. **True positives are penalized more**: Correct matches require 3+ references to exceed the threshold. At higher thresholds, fewer references clear the bar, causing legitimate matches to fail the `min_matching_refs=3` requirement.
|
||||
|
||||
2. **False positives survive differently**: False positive detections may have 1-2 references that happen to score very high (above the threshold) due to random visual similarities. Since we use max aggregation, these spurious high scores still produce matches.
|
||||
|
||||
3. **The margin becomes less effective**: When most scores are clustered below the threshold, the margin check operates on a smaller pool of candidates, reducing its discriminative power.
|
||||
|
||||
#### Threshold 0.90: Different Behavior
|
||||
|
||||
At threshold 0.90, behavior finally matches expectations:
|
||||
- False positives dropped dramatically (69 vs 288-472 in other tests)
|
||||
- But recall collapsed to 22.8%
|
||||
- Only 84 true positives out of 369 expected
|
||||
|
||||
This suggests that at 0.90, the threshold is finally high enough to filter out most noise, but it's too aggressive and rejects most legitimate matches as well.
|
||||
|
||||
#### The Optimal Threshold Problem
|
||||
|
||||
| Threshold | Precision | Recall | F1 | Assessment |
|
||||
|-----------|-----------|--------|-----|------------|
|
||||
| 0.70 | 47.9% | 71.8% | **57.5%** | Best overall F1 |
|
||||
| 0.80 | 33.0% | 63.1% | 43.4% | Worse than baseline |
|
||||
| 0.85 | 26.9-28.4% | 43-44% | 33-35% | Much worse |
|
||||
| 0.90 | 54.9% | 22.8% | 32.2% | Best precision, worst recall |
|
||||
|
||||
The lowest threshold tested (0.70) produced the best F1 score. This indicates:
|
||||
- CLIP embeddings don't provide clean separation at any threshold
|
||||
- The multi-ref matching with min_matching_refs provides better discrimination than threshold alone
|
||||
- Raising the threshold hurts true positives more than it helps reject false positives
|
||||
|
||||
#### Margin Parameter Impact
|
||||
|
||||
Comparing tests with the same threshold but different margins:
|
||||
|
||||
| Threshold | Margin 0.05 | Margin 0.10 | Margin 0.15 |
|
||||
|-----------|-------------|-------------|-------------|
|
||||
| 0.80 | F1: 43.4% | F1: 40.2% | - |
|
||||
| 0.85 | - | F1: 33.2% | F1: 34.6% |
|
||||
|
||||
Increasing the margin had minimal effect, slightly reducing both true and false positives. The margin parameter is less impactful than the threshold in this configuration.
|
||||
|
||||
### Key Findings
|
||||
|
||||
1. **The baseline (threshold=0.70, margin=0.05) was optimal**: No threshold/margin combination tested outperformed the defaults for F1 score.
|
||||
|
||||
2. **Threshold tuning alone cannot fix CLIP's limitations**: The embedding space doesn't provide clear separation points that can be exploited with threshold adjustments.
|
||||
|
||||
3. **min_matching_refs matters more than threshold**: The requirement for multiple matching references provides better discrimination than similarity threshold.
|
||||
|
||||
4. **Precision-recall trade-off is extreme**: Achieving 55% precision (at threshold 0.90) requires accepting only 23% recall.
|
||||
|
||||
5. **The 0.70-0.85 range is a "dead zone"**: Thresholds in this range produce worse results than either extreme.
|
||||
|
||||
### Implications
|
||||
|
||||
These results suggest that improving logo detection accuracy requires:
|
||||
- A different embedding model with better logo discrimination
|
||||
- Logo-specific fine-tuning
|
||||
- Alternative matching strategies beyond threshold-based approaches
|
||||
- Potentially ensemble methods combining multiple signals
|
||||
|
||||
Simply tuning threshold and margin parameters with CLIP is insufficient to achieve acceptable precision/recall balance.
|
||||
|
||||
---
|
||||
|
||||
## Test Run: [Next Test Name]
|
||||
|
||||
*Results pending...*
|
||||
|
||||
Reference in New Issue
Block a user