Usage section towards the end of the report
Test images are /images/test_xx.jpg
A crawler is implemented in crawl.py.
The crawler used selenium package. The crawler is designed specifically to crawl the product images on amazon.ca searching result, a page looks similar to this:
The crawler will target every product image and download the best quality image of all book covers.
amazon.ca display around 40-60 products in one page. The crawler can crawl one page at a time. However, by giving the crawler different pages URL as input, I used the crawler to download ~200 high quality book covers, each has a resolution of about 700x1000.
- The default images source was compressed. By investigating more on webpage html information format, I am able to locate the images with original quality.
- The crawler was once troubled with amazon.ca protection method, including reCAPTCHA. Solved by given the browser 10s window to allow manual input.
- SSL certificate check. Solved by python
certifipackage.
homography_RANSAC() that find the best homography between a book cover and a test image is implemented in helper.py,. The majority of the work is copied from my A4 submission.
The function can be described as following pseudocode
def find_all_matches(SIFT_img1, SIFT_img2):
"""
helper function for homography_RANSAC(): find matches between two sets of SIFT descriptors
"""
for each descriptor in SIFT_img1:
compute distance to all descriptors in SIFT_img2
if closest distance / second closest distance match < 0.8:
accept this match (closest match)
else:
discard this match
return [all matches between SIFT_img1 and SIFT_img2 found]
def homography_RANSAC():
SIFT_img1, SIFT_img2 <- SIFT 128D descriptors of both images
# a small optimizating to improve runtime from ~1min to ~10s
while any of SIFT_img1 or SIFT_img2 has more than 10000 SIFT descriptor detected:
increase the SIFT detection contrast Threshold
reassign to SIFT_img1 or SIFT_img2
# start the process
matches <- find_all_matches(SIFT_img1, SIFT_img2) # see function above
for 200 iterations:
randomly pick 4 matches and compute the homography
for the rest matches:
apply homography to img1 SIFT, compute the distance between computed and real img2 SIFT
if distance < some threshold:
inliner count += 1
else:
not an inliner
after 200 iterations, the four matches that has the highest inliner count is the winner
return the winner! Since both book covers and test images taken are very high resolution images, one image can have more than 20,000 SIFT detected. The time complexity of find_all_matches() is
The solution to this is the small optimization proposed in the pseudo-code above, if too much SIFT key points detected, there are too many noisy and meaningless key points, so increase the contrast threshold and recompute SIFT key points. I have tried the key points # threshold as 2000, 5000, 8000, 10000, and noticed if fewer than 10,000 key points are detected, the homography computation has worse quality, it does not correctly find the book covers in the test image.
Below is an example of using 2000 as threshold:
10,000 key points is a nice compromise number that ensures high quality homography computation and almost 10x runtime improvement (from ~90s to less than 10s). Below is the example using 10000 as threshold
In Nister and Stewenius paper, they proposed a Vocabulary Tree approach, which is essentially an hierarchical k-means tree of database SIFT 128D descriptors, that allows very efficient descriptor comparison that is able to find closest SIFT descriptors in
The python implementation can be found in find_book_cover.ipynb: build_vocabulary_tree() and find_closest_leaf() and database image scoring.
Pseudo-code of build_vocabulary_tree() and find_closest_leaf() :
def build_vocabulary_tree(depth):
all data are normalied to reduced the affection by the magnitude of SIFT
if reach bottom of the tree: # base case
leaves are SIFT descriptors and their corresponding image
else:
define kmeans model with 10 clusters and fit the data set
leaves are sub-vocabulary-tree, where center is kmeans center, and data is assigned to corresponding clusters based on euclidean distances.
build_vocabulary_tree(depth - 1) recursion on each of the kmean cluster
def find_closest_leaf():
recursively compare eculidean distance at each level
if reach the leaf (real SIFT with corresponding image), return the image. # base caseFor each query image, here's how to find the score of likelihood to all the database images:
scores <- tracker of scores of each database images
for each descriptor in test image:
normalize the descriptor
winner <- find_closest_leaf(vocabulary_tree, des)
scores[winner] += 1
images ranking is based on the highest score in scores**Run time **of this project is important, which I will discuss in the Key Findings section below.
For each of the query image, top10 book cover candidate are found and displayed:
3 out of 5 books are successfully matched (at top1, top2, top5). I will discuss about the 2 that didn't match later in "problem" area.
From below example, we can see the Efficient Retrieval is able to localize the book half covered, as long as the uncovered part has interesting enough features for the algorithm to match. (more on 5. and 6. about localize books).
Human beings can get bored if they don't click the mouse in five seconds. To not bored people, for any of the not-finish-instantaneously code block, I have add a progress bar to show time estimation. Using tqdm library.
If you play with my code, you will find following not-instantaneously code block:
Load whole database images. I have 203 images in my database, takes ~15s to load
Build vocabulary tree. You can select how deep the tree goes. The deeper tree the more time to build, but deeper tree make matching noticeable faster. The paper suggest 6 layers with branching factor(BF) of 10, which can take at least
Here's the relationship between layer number and run time. It reflects "The deeper tree the more time to build, but deeper tree make matching noticeable faster".
| Vocabulary Tree Layer | BF | Build Tree (second) | Matching (second) |
|---|---|---|---|
| 2 | 10 | 5 | 33 |
| 3 | 10 | 9.1 | 3 |
| 4 | 10 | 16 | less than 1 |
| 5 | 10 | can't build | N/A |
Layer number will not affect matching result.
If layer
Some books are usually hard to be found, like these two books:
Possible reasons could be these books are lack of distinguished pattern (large chunk of white area) and the text font is too thin so that key point detection method like SIFT cannot capture the pattern in the text or produce ambiguous SIFT key points that lead to a lower similarity score at the end of detection.
Here's SIFT detection of the SSENSE magazine:
The model in the center of the magazine cover and the car model on the bottom has some SIFT detected. However, most of the SIFT are detected in the text, that can easily cause ambiguity in matching.
After Efficient Retrieval had found the top10 candidate matches of book covers, we can manually select which of the top 10 matches we are trying to localize. In the last code cell: candidate = 0 select which candidate (0 indexed).
This is done by applying homography_RANSAC() from 3. to match one candidate to the test image.
The algorithm is now comparing all SIFT descriptors in the test image with book covers. However, after we do Efficient Retrieval, we already filtered a group of SIFT descriptors that points to the book cover. We can instead use those instead of all SIFT descriptors from the image to improve efficiency and localization accuracy.
To sum up and visualize what I did, taking the homography matrix computed in 5. and apply to the dimensions of the book covers, I am able to localize the books in test image.
Here's the gallery of matching result:
- Open
find_book_cover.ipynb - change
DATABASE = "book_covers", should be folder path to image database. - Run first code block that contains import and all functions.
- Run following two code blocks, which load the database and put together all SIFT descriptors.
- Run next code block to build Vocab Tree
vocabulary_tree = build_vocabulary_tree(all_SIFT, 4). Adjust 4 to layer number. Run time is discussed in above section - Using ipynb so that do not need to rebuild vocab tree every match.
- Change
TARGET = "images/test_5.jpg"to OS path to test image, run the block to match top10 from database - After that, manually select which candidate from top10 matches
candidate = 0(0 index) and run block to localize.
To summarize, This tool can robustly detect and localize book cover in the test image, even the book is half covered. As building deeper Vocab Tree will require more time but making matching way more efficient, in practice, it is worth to build a deeper Vocab Tree in the offline phase, so that in the online phase, the user can efficiently match book covers.














