- Test range of thresholds to find optimal F1 - Support both baseline and fine-tuned models - Option for max vs mean similarity aggregation - Output results table with TP/FP/FN/precision/recall/F1