This document contains performance metrics and search results for the three embedding models used in the image retrieval system.
- Qwen/Qwen3-Embedding-0.6B - Qwen embedding model (600M parameters)
- Alibaba-NLP/gte-multilingual-base - GTE-multilingual embedding model
- google/embeddinggemma-300m - Google EmbeddingGemma model (300M parameters)
- Total Images: 17 images
- Device: CPU (CUDA not available)
- Date: Test run from terminal output
| Model | Total Time | Per Image Time | Speed Ranking |
|---|---|---|---|
| GTE-multilingual | 1.72 seconds | 0.1009 seconds | 🥇 Fastest |
| EmbeddingGemma | 2.38 seconds | 0.1400 seconds | 🥈 Second |
| Qwen | 7.37 seconds | 0.4335 seconds | 🥉 Slowest |
- GTE-multilingual is the fastest, taking only 1.72 seconds for 17 images (0.10s per image)
- EmbeddingGemma is moderately fast, taking 2.38 seconds (0.14s per image)
- Qwen is the slowest, taking 7.37 seconds (0.43s per image) - approximately 4.3x slower than GTE
| Model | Search Time | Speed Ranking |
|---|---|---|
| GTE-multilingual | 0.3971 seconds | 🥇 Fastest |
| EmbeddingGemma | 0.9877 seconds | 🥈 Second |
| Qwen | 2.9163 seconds | 🥉 Slowest |
- GTE-multilingual is fastest for search (0.40s)
- EmbeddingGemma is moderate (0.99s)
- Qwen is slowest (2.92s) - approximately 7.3x slower than GTE
| Rank | Similarity | Image | Caption |
|---|---|---|---|
| 1 | 0.3831 | Screenshot 2025-12-04 132720.png | a man is riding a motorcycle on the street |
| 2 | 0.2454 | wallpaper.png | a tori tori floating in the water |
| 3 | 0.2336 | 1314707.jpg | a girl with long hair and a bow on her head |
Analysis:
- ✅ Correctly identified the motorcycle image as top result
⚠️ Lower similarity scores overall (max 0.38)⚠️ Second result (tori floating in water) seems unrelated
| Rank | Similarity | Image | Caption |
|---|---|---|---|
| 1 | 0.5844 | Screenshot 2025-12-04 132720.png | a man is riding a motorcycle on the street |
| 2 | 0.5578 | wallpaper.png | a tori tori floating in the water |
| 3 | 0.5199 | Screenshot 2025-12-03 122250.png | a tray with two cups of food and a drink |
Analysis:
- ✅ Correctly identified the motorcycle image as top result
- ✅ Higher similarity scores (max 0.58) - better confidence
⚠️ Second result still seems unrelated (tori floating in water)- ✅ Third result different from Qwen (food tray vs girl with bow)
| Rank | Similarity | Image | Caption |
|---|---|---|---|
| 1 | 0.3735 | Screenshot 2025-12-04 132720.png | a man is riding a motorcycle on the street |
| 2 | 0.2694 | ultra-instinct-goku-dragon-ball-super-5k-5760x3240-5127.jpg | dragon ball wallpaper by the - dragon - ballpaper |
| 3 | 0.2616 | Screenshot 2025-12-04 132823.png | a woman standing next to a tree |
Analysis:
- ✅ Correctly identified the motorcycle image as top result
⚠️ Similarity scores similar to Qwen (max 0.37)⚠️ Different second and third results compared to other models⚠️ Results seem less relevant (Dragon Ball wallpaper, woman by tree)
- GTE-multilingual: Fastest for both indexing and search
- EmbeddingGemma: Moderate speed
- Qwen: Slowest but still functional
-
GTE-multilingual:
- Highest similarity scores (0.58 for correct result)
- Best confidence in results
- Fastest search time
-
Qwen:
- Moderate similarity scores (0.38 for correct result)
- Correctly identified top result
- Slower but accurate
-
EmbeddingGemma:
- Similar similarity scores to Qwen (0.37 for correct result)
- Correctly identified top result
- Different ranking for lower results
For Speed-Critical Applications:
- Use GTE-multilingual - fastest overall performance
For Accuracy-Critical Applications:
- Use GTE-multilingual - highest similarity scores and best confidence
- Consider Qwen as alternative - good accuracy but slower
For Multilingual Support:
- GTE-multilingual - explicitly designed for multilingual tasks
- EmbeddingGemma - may have multilingual capabilities
For Balanced Performance:
- GTE-multilingual - best balance of speed and accuracy
- All models correctly identified the motorcycle image as the most relevant result for the "bike" query
- Similarity scores vary significantly between models (GTE shows higher confidence)
- Search results show some variation in ranking, especially for lower-ranked items
- Performance was tested on CPU; GPU acceleration would improve all models' speed
- Results may vary based on query type, image content, and caption quality
- Test with more diverse queries (objects, actions, scenes, abstract concepts)
- Test with larger image datasets (100+, 1000+ images)
- Compare results on GPU vs CPU
- Evaluate multilingual query performance
- Test with different image types (photos, illustrations, screenshots, etc.)
- Measure precision@k and recall@k metrics
- User preference studies for result quality
Last Updated: Based on test run from terminal output Test Environment: Windows, CPU-only, 17 images