Files
jersey_test/docs/LLAMA_SWAP_SETUP.md
Rick McEwen 8706edcd13 Initial commit: Jersey detection test suite
Test scripts and utilities for evaluating vision-language models
on jersey number detection using llama.cpp server.
2026-01-20 13:37:01 -07:00

7.0 KiB

llama-swap Setup Guide for Jersey Detection Testing

This guide explains how to use llama-swap to automatically switch between different vision language models when testing jersey detection.

What is llama-swap?

llama-swap is a model-swapping proxy that sits between your application and llama.cpp servers. It automatically loads and unloads models based on the model parameter in API requests, allowing you to test multiple models without manually restarting servers.

Installation

# Pull the CUDA image (or cpu, vulkan, intel depending on your hardware)
docker pull ghcr.io/mostlygeek/llama-swap:cuda

Homebrew (macOS/Linux)

brew tap mostlygeek/llama-swap
brew install llama-swap

Pre-built Binaries

Download from the releases page.

Configuration

A configuration file llama-swap-config.yaml is provided with 8 pre-configured vision models:

Small Models (1-4B parameters)

  • lfm2-vl-1.6b - LiquidAI LFM2-VL 1.6B (F16)
  • gemma-3-4b - Gemma 3 4B Instruct (F16)
  • kimi-vl-3b - Kimi VL A3B Thinking (F16)

Medium Models (7-12B parameters)

  • qwen2.5-vl-7b - Qwen2.5-VL 7B Instruct (F16)
  • gemma-3-12b - Gemma 3 12B Instruct (F16)

Large Models (24-27B parameters)

  • mistral-small-24b-q8 - Mistral Small 3.2 24B (Q8_K_XL)
  • mistral-small-24b-q4 - Mistral Small 3.2 24B (Q4_K_XL)
  • gemma-3-27b - Gemma 3 27B Instruct (Q8_0)

Starting llama-swap

Using Docker

docker run -it --rm --runtime nvidia -p 8080:8080 \
  -v $(pwd)/llama-swap-config.yaml:/app/config.yaml \
  -v /path/to/hf/cache:/root/.cache/huggingface \
  ghcr.io/mostlygeek/llama-swap:cuda

Using Binary

llama-swap --config llama-swap-config.yaml --listen localhost:8080

Testing with Jersey Detection Script

Once llama-swap is running, you can test different models by specifying the --model-tag parameter:

Test a Single Model

# Test Qwen2.5-VL 7B with resizing
python test_jersey_detection.py ./images jersey_prompt.txt \
  --model-tag "qwen2.5-vl-7b" \
  --resize 1024

Test Multiple Models Sequentially

# Test small models
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "lfm2-vl-1.6b" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-4b" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "kimi-vl-3b" --resize 1024

# Test medium models
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "qwen2.5-vl-7b" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-12b" --resize 1024

# Test large models
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "mistral-small-24b-q4" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-27b" --resize 1024

Automated Testing Scripts

Two bash scripts are provided for automated testing:

1. Full Test Suite (test_all_models.sh)

Tests all models defined in llama-swap-config.yaml:

# Basic usage (uses defaults)
./test_all_models.sh ./test_images

# Customize configuration with environment variables
RESIZE=2048 ./test_all_models.sh ./test_images
OUTPUT_FILE=custom_results.jsonl ./test_all_models.sh ./test_images
PROMPT_FILE=custom_prompt.txt ./test_all_models.sh ./test_images

# Disable resize
RESIZE= ./test_all_models.sh ./test_images

Features:

  • Automatically extracts all model tags from YAML config
  • Color-coded output with progress tracking
  • Confirms before starting tests
  • Shows summary with success/failure counts
  • Asks to continue if a model fails

Default Configuration:

  • Images: ./test_images
  • Prompt: jersey_prompt_with_confidence.txt
  • Resize: 1024px
  • Output: jersey_detection_results.jsonl

2. Quick Test (test_quick.sh)

Tests a small subset of models for rapid iteration:

# Test default selection (small, medium, large)
./test_quick.sh ./test_images

# Test custom models
MODELS="lfm2-vl-1.6b qwen2.5-vl-7b" ./test_quick.sh ./test_images

# Customize settings
RESIZE=512 MODELS="gemma-3-4b" ./test_quick.sh ./test_images

Default Models:

  • lfm2-vl-1.6b (Small - 1.6B)
  • qwen2.5-vl-7b (Medium - 7B)
  • mistral-small-24b-q4 (Large - 24B Q4)

Use Cases:

  • Quick validation after prompt changes
  • Testing configuration adjustments
  • Rapid prototyping before full test run

Analyzing Results

After testing multiple models, use the analysis script to compare performance:

python analyze_jersey_results.py

This will show:

  • Comparison table of all models tested
  • Performance charts with hallucination rates
  • Best performers by speed and accuracy
  • Confidence distribution (if applicable)

Model Swapping Behavior

llama-swap will:

  1. Automatically load the requested model when you specify --model-tag
  2. Automatically unload the previous model (if different from current request)
  3. Keep running if you test the same model multiple times
  4. Monitor model loading/unloading in the web UI at http://localhost:8080/ui

Optional: Model Auto-Unloading

To automatically unload models after 5 minutes of inactivity, uncomment this line in llama-swap-config.yaml:

ttl: 300

Optional: Preload Model on Startup

To preload a specific model when llama-swap starts, uncomment and modify this section:

hooks:
  onStartup:
    - loadModel: qwen2.5-vl-7b

Customizing Models

To add or modify models, edit llama-swap-config.yaml:

models:
  my-custom-model:
    name: "My Custom Model Description"
    cmd: llama-server --no-mmap -ngl 999 -fa on --host 0.0.0.0 --port ${PORT} -hf user/model-name:quantization

Then test with:

python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "my-custom-model"

Troubleshooting

Model not loading

  • Check llama-swap logs at http://localhost:8080/log or via curl http://localhost:8080/log/stream
  • Verify the model name in the config matches the --model-tag parameter
  • Ensure sufficient GPU memory for the model

Connection refused

  • Verify llama-swap is running: curl http://localhost:8080/health
  • Check the server URL matches: default is http://192.168.1.126:8080 (from scan.ini)

Slow model switching

  • First load downloads models from HuggingFace (can be slow)
  • Subsequent loads are faster (cached locally)
  • Use quantized models (Q4, Q8) for faster loading and lower memory usage

Web UI

llama-swap includes a web interface for monitoring:

  • Dashboard: http://localhost:8080/ui - View loaded models and logs
  • Activity: See recent API requests
  • Logs: Real-time log monitoring

References