MemeStreams | Clustering Billions of Images with Large Scale Nearest Neighbor Search

The proliferation of the web and digital photography have made large scale image collections containing billions of images a reality. Image collections on this scale make performing even the most common and simple computer vision, image processing, and machine learning tasks non-trivial. An example is nearest neighbor search, which not only serves as a fundamental subproblem in many more sophisticated algorithms, but also has direct applications, such as image retrieval and image clustering. In this paper, we address the nearest neighbor problem as the first step towards scalable image processing. We describe a scalable version of an approximate nearest neighbor search algorithm and discuss how it can be used to find near duplicates among over a billion images.

Found via browsing after pointer to Google's Analysis of Disk Failures.

You probably won't pay $19 to read this paper. However, the lead author's thesis covers the same territory:

Spill-tree is designed for approximate knn search. By adapting metric-trees to a more flexible data structure, spill-tree is able to adapt to the distribution of data and it scales well even for huge high-dimensional data sets. Significant efficiency improvement has been observed comparing to LSH (localify sensitive hashing), the state of art approximate knn algorithm. We applied spill-tree to three real-world applications: shot video segmentation, drug activity detection and image clustering, which I will explain in the thesis.

Her lab page also offers additional resources, including a survey of approximate nearest-neighbor algorithms and a more recent study on autonomous visualization. That's more for the life sciences, but still quite interesting.

Note that the now-at-Google Ting Liu is not to be confused with the Ting Liu at Princeton, who interned with Kevin Fall at Intel Research in Berkeley. She works on DTNs in sensor networks.

One of the other co-authors, Henry Rowley, has recently directly addressed the question of the day; well, that's overstating it, but this is likely (part of) the technology behind SafeSearch. (He's also on his way to breaking the hot-or-not captcha.)

Clustering Billions of Images with Large Scale Nearest Neighbor Search