Initial commit: Jersey detection test suite
Test scripts and utilities for evaluating vision-language models on jersey number detection using llama.cpp server.
This commit is contained in:
237
docs/LLAMA_SWAP_SETUP.md
Normal file
237
docs/LLAMA_SWAP_SETUP.md
Normal file
@ -0,0 +1,237 @@
|
||||
# llama-swap Setup Guide for Jersey Detection Testing
|
||||
|
||||
This guide explains how to use [llama-swap](https://github.com/mostlygeek/llama-swap) to automatically switch between different vision language models when testing jersey detection.
|
||||
|
||||
## What is llama-swap?
|
||||
|
||||
llama-swap is a model-swapping proxy that sits between your application and llama.cpp servers. It automatically loads and unloads models based on the `model` parameter in API requests, allowing you to test multiple models without manually restarting servers.
|
||||
|
||||
## Installation
|
||||
|
||||
### Docker (Recommended)
|
||||
|
||||
```bash
|
||||
# Pull the CUDA image (or cpu, vulkan, intel depending on your hardware)
|
||||
docker pull ghcr.io/mostlygeek/llama-swap:cuda
|
||||
```
|
||||
|
||||
### Homebrew (macOS/Linux)
|
||||
|
||||
```bash
|
||||
brew tap mostlygeek/llama-swap
|
||||
brew install llama-swap
|
||||
```
|
||||
|
||||
### Pre-built Binaries
|
||||
|
||||
Download from the [releases page](https://github.com/mostlygeek/llama-swap/releases).
|
||||
|
||||
## Configuration
|
||||
|
||||
A configuration file `llama-swap-config.yaml` is provided with 8 pre-configured vision models:
|
||||
|
||||
### Small Models (1-4B parameters)
|
||||
- `lfm2-vl-1.6b` - LiquidAI LFM2-VL 1.6B (F16)
|
||||
- `gemma-3-4b` - Gemma 3 4B Instruct (F16)
|
||||
- `kimi-vl-3b` - Kimi VL A3B Thinking (F16)
|
||||
|
||||
### Medium Models (7-12B parameters)
|
||||
- `qwen2.5-vl-7b` - Qwen2.5-VL 7B Instruct (F16)
|
||||
- `gemma-3-12b` - Gemma 3 12B Instruct (F16)
|
||||
|
||||
### Large Models (24-27B parameters)
|
||||
- `mistral-small-24b-q8` - Mistral Small 3.2 24B (Q8_K_XL)
|
||||
- `mistral-small-24b-q4` - Mistral Small 3.2 24B (Q4_K_XL)
|
||||
- `gemma-3-27b` - Gemma 3 27B Instruct (Q8_0)
|
||||
|
||||
## Starting llama-swap
|
||||
|
||||
### Using Docker
|
||||
|
||||
```bash
|
||||
docker run -it --rm --runtime nvidia -p 8080:8080 \
|
||||
-v $(pwd)/llama-swap-config.yaml:/app/config.yaml \
|
||||
-v /path/to/hf/cache:/root/.cache/huggingface \
|
||||
ghcr.io/mostlygeek/llama-swap:cuda
|
||||
```
|
||||
|
||||
### Using Binary
|
||||
|
||||
```bash
|
||||
llama-swap --config llama-swap-config.yaml --listen localhost:8080
|
||||
```
|
||||
|
||||
## Testing with Jersey Detection Script
|
||||
|
||||
Once llama-swap is running, you can test different models by specifying the `--model-tag` parameter:
|
||||
|
||||
### Test a Single Model
|
||||
|
||||
```bash
|
||||
# Test Qwen2.5-VL 7B with resizing
|
||||
python test_jersey_detection.py ./images jersey_prompt.txt \
|
||||
--model-tag "qwen2.5-vl-7b" \
|
||||
--resize 1024
|
||||
```
|
||||
|
||||
### Test Multiple Models Sequentially
|
||||
|
||||
```bash
|
||||
# Test small models
|
||||
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "lfm2-vl-1.6b" --resize 1024
|
||||
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-4b" --resize 1024
|
||||
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "kimi-vl-3b" --resize 1024
|
||||
|
||||
# Test medium models
|
||||
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "qwen2.5-vl-7b" --resize 1024
|
||||
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-12b" --resize 1024
|
||||
|
||||
# Test large models
|
||||
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "mistral-small-24b-q4" --resize 1024
|
||||
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-27b" --resize 1024
|
||||
```
|
||||
|
||||
### Automated Testing Scripts
|
||||
|
||||
Two bash scripts are provided for automated testing:
|
||||
|
||||
#### 1. Full Test Suite (`test_all_models.sh`)
|
||||
|
||||
Tests **all models** defined in `llama-swap-config.yaml`:
|
||||
|
||||
```bash
|
||||
# Basic usage (uses defaults)
|
||||
./test_all_models.sh ./test_images
|
||||
|
||||
# Customize configuration with environment variables
|
||||
RESIZE=2048 ./test_all_models.sh ./test_images
|
||||
OUTPUT_FILE=custom_results.jsonl ./test_all_models.sh ./test_images
|
||||
PROMPT_FILE=custom_prompt.txt ./test_all_models.sh ./test_images
|
||||
|
||||
# Disable resize
|
||||
RESIZE= ./test_all_models.sh ./test_images
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Automatically extracts all model tags from YAML config
|
||||
- Color-coded output with progress tracking
|
||||
- Confirms before starting tests
|
||||
- Shows summary with success/failure counts
|
||||
- Asks to continue if a model fails
|
||||
|
||||
**Default Configuration:**
|
||||
- Images: `./test_images`
|
||||
- Prompt: `jersey_prompt_with_confidence.txt`
|
||||
- Resize: `1024px`
|
||||
- Output: `jersey_detection_results.jsonl`
|
||||
|
||||
#### 2. Quick Test (`test_quick.sh`)
|
||||
|
||||
Tests a **small subset** of models for rapid iteration:
|
||||
|
||||
```bash
|
||||
# Test default selection (small, medium, large)
|
||||
./test_quick.sh ./test_images
|
||||
|
||||
# Test custom models
|
||||
MODELS="lfm2-vl-1.6b qwen2.5-vl-7b" ./test_quick.sh ./test_images
|
||||
|
||||
# Customize settings
|
||||
RESIZE=512 MODELS="gemma-3-4b" ./test_quick.sh ./test_images
|
||||
```
|
||||
|
||||
**Default Models:**
|
||||
- `lfm2-vl-1.6b` (Small - 1.6B)
|
||||
- `qwen2.5-vl-7b` (Medium - 7B)
|
||||
- `mistral-small-24b-q4` (Large - 24B Q4)
|
||||
|
||||
**Use Cases:**
|
||||
- Quick validation after prompt changes
|
||||
- Testing configuration adjustments
|
||||
- Rapid prototyping before full test run
|
||||
|
||||
## Analyzing Results
|
||||
|
||||
After testing multiple models, use the analysis script to compare performance:
|
||||
|
||||
```bash
|
||||
python analyze_jersey_results.py
|
||||
```
|
||||
|
||||
This will show:
|
||||
- Comparison table of all models tested
|
||||
- Performance charts with hallucination rates
|
||||
- Best performers by speed and accuracy
|
||||
- Confidence distribution (if applicable)
|
||||
|
||||
## Model Swapping Behavior
|
||||
|
||||
llama-swap will:
|
||||
1. **Automatically load** the requested model when you specify `--model-tag`
|
||||
2. **Automatically unload** the previous model (if different from current request)
|
||||
3. **Keep running** if you test the same model multiple times
|
||||
4. **Monitor** model loading/unloading in the web UI at `http://localhost:8080/ui`
|
||||
|
||||
## Optional: Model Auto-Unloading
|
||||
|
||||
To automatically unload models after 5 minutes of inactivity, uncomment this line in `llama-swap-config.yaml`:
|
||||
|
||||
```yaml
|
||||
ttl: 300
|
||||
```
|
||||
|
||||
## Optional: Preload Model on Startup
|
||||
|
||||
To preload a specific model when llama-swap starts, uncomment and modify this section:
|
||||
|
||||
```yaml
|
||||
hooks:
|
||||
onStartup:
|
||||
- loadModel: qwen2.5-vl-7b
|
||||
```
|
||||
|
||||
## Customizing Models
|
||||
|
||||
To add or modify models, edit `llama-swap-config.yaml`:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
my-custom-model:
|
||||
name: "My Custom Model Description"
|
||||
cmd: llama-server --no-mmap -ngl 999 -fa on --host 0.0.0.0 --port ${PORT} -hf user/model-name:quantization
|
||||
```
|
||||
|
||||
Then test with:
|
||||
|
||||
```bash
|
||||
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "my-custom-model"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Model not loading
|
||||
- Check llama-swap logs at `http://localhost:8080/log` or via `curl http://localhost:8080/log/stream`
|
||||
- Verify the model name in the config matches the `--model-tag` parameter
|
||||
- Ensure sufficient GPU memory for the model
|
||||
|
||||
### Connection refused
|
||||
- Verify llama-swap is running: `curl http://localhost:8080/health`
|
||||
- Check the server URL matches: default is `http://192.168.1.126:8080` (from scan.ini)
|
||||
|
||||
### Slow model switching
|
||||
- First load downloads models from HuggingFace (can be slow)
|
||||
- Subsequent loads are faster (cached locally)
|
||||
- Use quantized models (Q4, Q8) for faster loading and lower memory usage
|
||||
|
||||
## Web UI
|
||||
|
||||
llama-swap includes a web interface for monitoring:
|
||||
- **Dashboard**: `http://localhost:8080/ui` - View loaded models and logs
|
||||
- **Activity**: See recent API requests
|
||||
- **Logs**: Real-time log monitoring
|
||||
|
||||
## References
|
||||
|
||||
- [llama-swap GitHub](https://github.com/mostlygeek/llama-swap)
|
||||
- [llama-swap Documentation](https://github.com/mostlygeek/llama-swap/tree/main/docs)
|
||||
- [llama.cpp Documentation](https://github.com/ggerganov/llama.cpp)
|
||||
Reference in New Issue
Block a user