- Add section explaining how margin works differently in multi-ref vs margin-only matching, with examples showing why margin-only fails when using multiple references per logo - Update run_model_comparison.sh to use optimal threshold (0.70) and margin (0.05) based on test results - Add DINOv2 Large model test to comparison script - Add threshold optimization test analysis to results document
431 lines
16 KiB
Markdown
431 lines
16 KiB
Markdown
# Logo Detection Test Methodology
|
|
|
|
This document describes how the logo detection test framework works and the various techniques implemented to improve detection accuracy.
|
|
|
|
## Overview
|
|
|
|
The system uses a two-stage pipeline:
|
|
1. **DETR** (DEtection TRansformer) - Detects potential logo regions in images
|
|
2. **CLIP** (Contrastive Language-Image Pre-training) - Extracts feature embeddings for matching
|
|
|
|
## Test Framework (`test_logo_detection.py`)
|
|
|
|
### Test Flow
|
|
|
|
1. **Sample Reference Logos**: Randomly select N logos from the database, with multiple reference images per logo
|
|
2. **Compute Reference Embeddings**: Generate CLIP embeddings for all reference logo images
|
|
3. **Build Test Set**: For each sampled logo, select:
|
|
- Positive samples: Images known to contain the logo
|
|
- Negative samples: Images known NOT to contain the logo
|
|
4. **Run Detection**: Process each test image through DETR to find logo regions
|
|
5. **Match Against References**: Compare detected regions against reference embeddings using margin-based matching
|
|
6. **Calculate Metrics**: Compute precision, recall, and F1 score
|
|
|
|
### Configurable Parameters
|
|
|
|
#### General Parameters
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `--num-logos` | 10 | Number of reference logos to sample |
|
|
| `--refs-per-logo` | 3 | Reference images per logo |
|
|
| `--positive-samples` | 5 | Positive test images per logo |
|
|
| `--negative-samples` | 20 | Negative test images per logo |
|
|
| `--threshold` | 0.7 | CLIP similarity threshold for matching |
|
|
| `--detr-threshold` | 0.5 | DETR detection confidence threshold |
|
|
| `--seed` | None | Random seed for reproducibility |
|
|
|
|
#### Matching Method Selection
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `--matching-method` | margin | Matching method: `simple`, `margin`, or `multi-ref` |
|
|
| `--margin` | 0.05 | Required margin between best and second-best match (applies to `margin` and `multi-ref`) |
|
|
|
|
#### Multi-Ref Method Parameters (when `--matching-method multi-ref`)
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `--min-matching-refs` | 1 | Minimum references that must match above threshold |
|
|
| `--use-max-similarity` | False | Use max similarity instead of mean across references |
|
|
|
|
#### Cache Control
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `--no-cache` | False | Disable embedding cache |
|
|
| `--clear-cache` | False | Clear cache before running |
|
|
|
|
### Metrics
|
|
|
|
- **True Positives**: Detected logo correctly matches expected logo
|
|
- **False Positives**: Detected logo matches wrong logo or image has no logo
|
|
- **False Negatives**: Expected logo not detected/matched
|
|
- **Precision**: TP / (TP + FP) - How many detections were correct
|
|
- **Recall**: TP / Total Expected - How many logos were found
|
|
- **F1 Score**: Harmonic mean of precision and recall
|
|
|
|
---
|
|
|
|
## Accuracy Improvement Techniques
|
|
|
|
### 1. Non-Maximum Suppression (NMS)
|
|
|
|
**Location**: `logo_detection_detr.py:214-268`
|
|
|
|
**Problem**: DETR may produce multiple overlapping bounding boxes for the same logo.
|
|
|
|
**Solution**: NMS removes redundant detections by:
|
|
1. Sorting detections by confidence score (descending)
|
|
2. Keeping the highest-scoring box
|
|
3. Removing any remaining boxes with IoU > threshold (default 0.5)
|
|
4. Repeating until no boxes remain
|
|
|
|
```
|
|
IoU (Intersection over Union) = Area of Overlap / Area of Union
|
|
```
|
|
|
|
**Configuration**: `nms_iou_threshold` parameter (default: 0.5)
|
|
|
|
---
|
|
|
|
### 2. Minimum Box Size Filtering
|
|
|
|
**Location**: `logo_detection_detr.py:187-191`
|
|
|
|
**Problem**: Very small detections are often noise or partial logo fragments.
|
|
|
|
**Solution**: Filter out detections where width OR height is below a minimum threshold.
|
|
|
|
**Configuration**: `min_box_size` parameter (default: 20 pixels)
|
|
|
|
---
|
|
|
|
### 3. Confidence Threshold Filtering
|
|
|
|
**Location**: `logo_detection_detr.py:177-179`
|
|
|
|
**Problem**: Low-confidence DETR detections are unreliable.
|
|
|
|
**Solution**: Only keep detections with confidence score >= threshold.
|
|
|
|
**Configuration**: `detr_threshold` parameter (default: 0.5)
|
|
|
|
---
|
|
|
|
### 4. Multiple Reference Images Per Logo
|
|
|
|
**Location**: `logo_detection_detr.py:397-457` (`find_best_match_multi_ref`)
|
|
|
|
**Problem**: A single reference image may not capture all variations of a logo (different angles, lighting, scales).
|
|
|
|
**Solution**: Use multiple reference images per logo and aggregate their similarity scores:
|
|
- Calculate similarity to each reference embedding
|
|
- Count how many references match above threshold
|
|
- Use mean or max similarity as the aggregate score
|
|
- Require a minimum number of references to match
|
|
|
|
**Configuration**:
|
|
- `refs_per_logo`: Number of reference images (default: 3)
|
|
- `min_matching_refs`: Minimum references that must match
|
|
- `use_max_similarity`: Use max instead of mean aggregation (default: False)
|
|
|
|
#### Mean vs Max Similarity Aggregation
|
|
|
|
When comparing a detected region against multiple reference images for the same logo, we need to combine the individual similarity scores into a single aggregate score. The two options are:
|
|
|
|
**Mean Similarity** (default, `--use-max-similarity` NOT set):
|
|
- Calculates the average similarity across ALL reference images
|
|
- More conservative: requires consistent matching across references
|
|
- Better at rejecting false positives where only one reference happens to match
|
|
|
|
**Max Similarity** (`--use-max-similarity` flag):
|
|
- Takes the HIGHEST similarity score from any single reference
|
|
- More lenient: only needs one good match to succeed
|
|
- Better recall when logos have high variability (one reference might be a perfect match)
|
|
|
|
#### Detailed Example
|
|
|
|
Suppose we have 5 reference images for the Nike logo, and a detected region produces these similarity scores:
|
|
|
|
| Reference | Similarity |
|
|
|-----------|------------|
|
|
| nike_ref1.png | 0.92 |
|
|
| nike_ref2.png | 0.78 |
|
|
| nike_ref3.png | 0.85 |
|
|
| nike_ref4.png | 0.71 |
|
|
| nike_ref5.png | 0.88 |
|
|
|
|
**With Mean Aggregation:**
|
|
```
|
|
Score = (0.92 + 0.78 + 0.85 + 0.71 + 0.88) / 5 = 0.828
|
|
```
|
|
The score reflects the overall consistency of the match. If one reference is an outlier (like nike_ref4 at 0.71), it pulls the average down.
|
|
|
|
**With Max Aggregation:**
|
|
```
|
|
Score = max(0.92, 0.78, 0.85, 0.71, 0.88) = 0.92
|
|
```
|
|
The score reflects the best possible match. The lower-scoring references don't affect the result.
|
|
|
|
#### When to Use Each
|
|
|
|
| Scenario | Recommended | Why |
|
|
|----------|-------------|-----|
|
|
| Logos with consistent appearance | Mean | Penalizes partial matches that only hit one variant |
|
|
| Logos with high variability (different colors, orientations) | Max | One reference matching well is sufficient evidence |
|
|
| High false positive rate | Mean | More conservative scoring reduces false matches |
|
|
| High false negative rate | Max | More lenient scoring catches more true matches |
|
|
| Reference images are all similar | Either | Results will be similar |
|
|
| Reference images show different logo variants | Max | Each variant should be allowed to match independently |
|
|
|
|
#### Combined Example with min_matching_refs
|
|
|
|
The `min_matching_refs` parameter works independently of the aggregation method. It counts how many references exceed the threshold, regardless of which aggregation is used for the final score.
|
|
|
|
**Example with threshold=0.80, min_matching_refs=2:**
|
|
|
|
| Reference | Similarity | Above Threshold? |
|
|
|-----------|------------|------------------|
|
|
| nike_ref1.png | 0.92 | Yes |
|
|
| nike_ref2.png | 0.78 | No |
|
|
| nike_ref3.png | 0.85 | Yes |
|
|
| nike_ref4.png | 0.71 | No |
|
|
| nike_ref5.png | 0.88 | Yes |
|
|
|
|
- References above threshold: 3 (nike_ref1, nike_ref3, nike_ref5)
|
|
- min_matching_refs requirement: 2 ✓ (3 >= 2, so we proceed)
|
|
- Mean score: 0.828
|
|
- Max score: 0.92
|
|
|
|
If only 1 reference was above threshold, the match would be rejected regardless of the aggregated score.
|
|
|
|
---
|
|
|
|
### 5. Margin-Based Matching
|
|
|
|
**Location**: `logo_detection_detr.py:459-505` (`find_best_match_with_margin`)
|
|
|
|
**Problem**: When multiple logos have similar embeddings, the best match may not be significantly better than alternatives, leading to false positives.
|
|
|
|
**Solution**: Require the best match to exceed the second-best match by a minimum margin:
|
|
|
|
```
|
|
Match only if: best_similarity - second_best_similarity >= margin
|
|
```
|
|
|
|
This ensures confident matches and reduces ambiguous classifications.
|
|
|
|
**Configuration**: `--margin` parameter (default: 0.05)
|
|
|
|
**Example**:
|
|
- Best match: Logo A with similarity 0.82
|
|
- Second best: Logo B with similarity 0.79
|
|
- Margin required: 0.05
|
|
- Result: **No match** (0.82 - 0.79 = 0.03 < 0.05)
|
|
|
|
#### Margin in Multi-Ref vs Margin-Only Matching
|
|
|
|
The margin parameter applies to both `margin` and `multi-ref` methods, but operates at different levels:
|
|
|
|
| Method | What Margin Compares |
|
|
|--------|---------------------|
|
|
| `margin` | Best **reference embedding** vs second-best **reference embedding** |
|
|
| `multi-ref` | Best **logo's aggregated score** vs second-best **logo's aggregated score** |
|
|
|
|
This distinction is critical when using multiple references per logo.
|
|
|
|
#### The Problem with Margin-Only and Multiple References
|
|
|
|
In margin-only matching, all individual reference embeddings compete against each other—including references from the **same logo**. This causes legitimate matches to be rejected.
|
|
|
|
**Example showing the problem:**
|
|
|
|
Suppose Nike has 3 references and Adidas has 3 references. A detected region produces:
|
|
|
|
| Reference | Similarity |
|
|
|-----------|------------|
|
|
| Nike_ref1 | 0.92 |
|
|
| Nike_ref2 | 0.91 |
|
|
| Nike_ref3 | 0.85 |
|
|
| Adidas_ref1 | 0.78 |
|
|
| Adidas_ref2 | 0.75 |
|
|
| Adidas_ref3 | 0.72 |
|
|
|
|
**With margin-only matching (margin=0.05):**
|
|
- Best reference: Nike_ref1 (0.92)
|
|
- Second-best reference: Nike_ref2 (0.91) ← Same logo!
|
|
- Margin check: 0.92 - 0.91 = 0.01 < 0.05 → **Rejected**
|
|
|
|
The match is rejected even though this is clearly a Nike logo! Nike's own references compete against each other and fail the margin test.
|
|
|
|
**With multi-ref matching (margin=0.05):**
|
|
- First, aggregate scores per logo:
|
|
- Nike: max(0.92, 0.91, 0.85) = 0.92
|
|
- Adidas: max(0.78, 0.75, 0.72) = 0.78
|
|
- Best logo: Nike (0.92)
|
|
- Second-best logo: Adidas (0.78)
|
|
- Margin check: 0.92 - 0.78 = 0.14 >= 0.05 → **Accepted**
|
|
|
|
This is why margin-only matching produces very low recall when using multiple references per logo—it was designed for single-reference scenarios.
|
|
|
|
---
|
|
|
|
### 6. Embedding Caching
|
|
|
|
**Location**: `test_logo_detection.py:49-82` (`EmbeddingCache` class)
|
|
|
|
**Problem**: Computing CLIP embeddings is computationally expensive. Re-running tests would reprocess the same images.
|
|
|
|
**Solution**: Cache embeddings to disk using pickle:
|
|
- Reference embeddings keyed by `ref:{filename}`
|
|
- Detection results keyed by `det:{filename}`
|
|
- Cache persists between runs (`.embedding_cache.pkl`)
|
|
|
|
**Configuration**:
|
|
- `--no-cache`: Disable caching entirely
|
|
- `--clear-cache`: Clear cache before running
|
|
|
|
---
|
|
|
|
### 7. Normalized Embeddings for Cosine Similarity
|
|
|
|
**Location**: `logo_detection_detr.py:334-335`
|
|
|
|
**Problem**: Raw CLIP embeddings have varying magnitudes, which can affect similarity calculations.
|
|
|
|
**Solution**: L2-normalize all embeddings before comparison:
|
|
|
|
```python
|
|
features = F.normalize(features, dim=-1)
|
|
```
|
|
|
|
This ensures cosine similarity is computed correctly and scores fall in the range [-1, 1].
|
|
|
|
---
|
|
|
|
## Matching Methods Summary
|
|
|
|
| Method | Test Script Option | Key Feature |
|
|
|--------|-------------------|-------------|
|
|
| `find_all_matches` | `--matching-method simple` | Returns ALL logos above threshold (baseline, most permissive) |
|
|
| `find_best_match_with_margin` | `--matching-method margin` | Requires margin over second-best match |
|
|
| `find_best_match_multi_ref` | `--matching-method multi-ref` | Aggregates scores across reference images |
|
|
|
|
The test script supports `simple`, `margin`, and `multi-ref` matching methods via the `--matching-method` parameter.
|
|
|
|
---
|
|
|
|
## Detection Pipeline Summary
|
|
|
|
```
|
|
Input Image
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ DETR Object Detection │
|
|
│ - Identifies potential logo regions│
|
|
│ - Returns bounding boxes + scores │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Confidence Filtering │
|
|
│ - Remove detections < threshold │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Size Filtering │
|
|
│ - Remove boxes < min_box_size │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ CLIP Embedding Extraction │
|
|
│ - Crop each detected region │
|
|
│ - Generate normalized embedding │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Non-Maximum Suppression │
|
|
│ - Remove overlapping detections │
|
|
│ - Keep highest confidence boxes │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Matching (selectable method) │
|
|
│ ┌─────────┬─────────┬────────────┐ │
|
|
│ │ simple │ margin │ multi-ref │ │
|
|
│ ├─────────┼─────────┼────────────┤ │
|
|
│ │ All │ Require │ Aggregate │ │
|
|
│ │ matches │ margin │ across │ │
|
|
│ │ above │ over │ refs │ │
|
|
│ │ thresh │ 2nd best│ (mean/max) │ │
|
|
│ └─────────┴─────────┴────────────┘ │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
Matched Logo Labels
|
|
```
|
|
|
|
---
|
|
|
|
## Tuning Recommendations
|
|
|
|
### For Simple Matching (`--matching-method simple`)
|
|
|
|
| Goal | Adjustments |
|
|
|------|-------------|
|
|
| **Reduce false positives** | Increase `--threshold` (only tuning option for simple method) |
|
|
| **Reduce false negatives** | Decrease `--threshold` |
|
|
|
|
Note: Simple matching is primarily used as a baseline. For production use, consider `margin` or `multi-ref`.
|
|
|
|
### For Margin-Based Matching (`--matching-method margin`)
|
|
|
|
| Goal | Adjustments |
|
|
|------|-------------|
|
|
| **Reduce false positives** | Increase `--threshold`, increase `--margin` |
|
|
| **Reduce false negatives** | Decrease `--threshold`, decrease `--margin` |
|
|
|
|
### For Multi-Ref Matching (`--matching-method multi-ref`)
|
|
|
|
| Goal | Adjustments |
|
|
|------|-------------|
|
|
| **Reduce false positives** | Increase `--threshold`, increase `--margin`, increase `--min-matching-refs`, use mean similarity |
|
|
| **Reduce false negatives** | Decrease `--threshold`, decrease `--margin`, decrease `--min-matching-refs`, use `--use-max-similarity` |
|
|
|
|
### General Tuning
|
|
|
|
| Goal | Adjustments |
|
|
|------|-------------|
|
|
| **Faster processing** | Decrease `--refs-per-logo`, use caching |
|
|
| **More robust detection** | Increase `--refs-per-logo`, decrease `--detr-threshold` |
|
|
| **Higher precision** | Increase `--detr-threshold`, use margin method with high margin |
|
|
| **Higher recall** | Decrease `--detr-threshold`, use multi-ref with low `--min-matching-refs` |
|
|
|
|
---
|
|
|
|
## Example Usage
|
|
|
|
```bash
|
|
# Simple matching (baseline - all matches above threshold)
|
|
python test_logo_detection.py -n 20 --matching-method simple --threshold 0.70
|
|
|
|
# Default margin-based matching
|
|
python test_logo_detection.py -n 20 --threshold 0.75 --margin 0.05
|
|
|
|
# Multi-ref matching with margin (recommended for reducing false positives)
|
|
python test_logo_detection.py -n 20 --matching-method multi-ref \
|
|
--refs-per-logo 5 --min-matching-refs 2 --threshold 0.70 --margin 0.05
|
|
|
|
# Multi-ref matching with max similarity (more lenient)
|
|
python test_logo_detection.py -n 20 --matching-method multi-ref \
|
|
--refs-per-logo 5 --min-matching-refs 1 --use-max-similarity --margin 0.03
|
|
|
|
# Reproducible test with seed
|
|
python test_logo_detection.py -n 50 --seed 42 --clear-cache
|
|
``` |