The from_pretrained method was applying LoRA twice:
1. In the constructor via lora_r parameter
2. When loading with PeftModel.from_pretrained()
Now creates model with lora_r=0 and loads LoRA weights separately.
Note: Warning about "missing adapter keys" for layers 0-11 is expected
since those layers are frozen and don't have LoRA adapters.
Previously the trainer saved a new "best" model if either separation
OR loss improved, with loss checked as a fallback. This caused
confusing behavior where models with lower separation could overwrite
better models.
Now only separation (gap between positive and negative similarity) is
used to determine the best model, which is the key metric for
contrastive learning quality.
Implement contrastive learning with LoRA to fine-tune CLIP's vision
encoder on LogoDet-3K dataset for improved logo embedding similarity.
New training module (training/):
- config.py: TrainingConfig dataclass with all hyperparameters
- dataset.py: LogoContrastiveDataset with logo-level splits
- model.py: LogoFineTunedCLIP wrapper with LoRA support
- losses.py: InfoNCE, TripletLoss, SupConLoss implementations
- trainer.py: Training loop with mixed precision and checkpointing
- evaluation.py: EmbeddingEvaluator for validation metrics
New scripts:
- train_clip_logo.py: Main training entry point
- export_model.py: Export to HuggingFace-compatible format
Configurations:
- configs/jetson_orin.yaml: Optimized for Jetson Orin AGX
- configs/cloud_rtx4090.yaml: Optimized for 24GB cloud GPUs
- configs/cloud_a100.yaml: Optimized for 80GB cloud GPUs
Documentation:
- CLIP_FINETUNING.md: Training guide and usage instructions
- CLOUD_TRAINING.md: Cloud GPU recommendations and cost estimates
Modified:
- logo_detection_detr.py: Add fine-tuned model loading support
- pyproject.toml: Add peft, pyyaml, torchvision dependencies