Choosing Index#
Give a great amount of indexes and quantizers, how to choose the one in the experiment/application? In this part, we will give a general suggestion on how to choose the one fits your need.
0. Preparation#
Packages#
For CPU usage, run:
# %pip install -U faiss-cpu numpy h5py
For GPU on Linux x86_64 system, use Conda:
conda install -c pytorch -c nvidia faiss-gpu=1.8.0
from urllib.request import urlretrieve
import h5py
import faiss
import numpy as np
Dataset#
In this tutorial, we’ll use SIFT1M, a very popular dataset for ANN evaluation, as our dataset to demonstrate the comparison.
Run the following cell to download the dataset or you can also manually download from the repo ann-benchmarks)
data_url = "http://ann-benchmarks.com/sift-128-euclidean.hdf5"
destination = "data.hdf5"
urlretrieve(data_url, destination)
Then load the data from the hdf5 file.
with h5py.File('data.hdf5', 'r') as f:
corpus = f['train'][:]
query = f['test'][:]
print(corpus.shape, corpus.dtype)
print(query.shape, corpus.dtype)
(1000000, 128) float32
(10000, 128) float32
d = corpus[0].shape[0]
k = 100
Helper function#
The following is a helper function for computing recall.
# compute recall from the prediction results and ground truth
def compute_recall(res, truth):
recall = 0
for i in range(len(res)):
intersect = np.intersect1d(res[i], truth[i])
recall += len(intersect) / len(res[i])
recall /= len(res)
return recall
1. Flat Index#
Flat index use brute force to search neighbors for each query. It guarantees the optimal result with 100% recall. Thus we use the result from it as the ground truth.
%%time
index = faiss.IndexFlatL2(d)
index.add(corpus)
CPU times: user 69.2 ms, sys: 80.6 ms, total: 150 ms
Wall time: 149 ms
%%time
D, I_truth = index.search(query, k)
CPU times: user 17min 30s, sys: 1.62 s, total: 17min 31s
Wall time: 2min 1s
2. IVF Index#
%%time
nlist = 5
nprob = 3
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
index.nprobe = nprob
index.train(corpus)
index.add(corpus)
CPU times: user 10.6 s, sys: 831 ms, total: 11.4 s
Wall time: 419 ms
%%time
D, I = index.search(query, k)
CPU times: user 9min 15s, sys: 598 ms, total: 9min 16s
Wall time: 12.5 s
recall = compute_recall(I, I_truth)
print(f"Recall: {recall}")
Recall: 0.9999189999999997
From the test we can see that IVFFlatL2 has a pretty good promotion for the searching speed with a very tiny loss of recall.
3. HNSW Index#
%%time
M = 64
ef_search = 32
ef_construction = 64
index = faiss.IndexHNSWFlat(d, M)
# set the two parameters before adding data
index.hnsw.efConstruction = ef_construction
index.hnsw.efSearch = ef_search
index.add(corpus)
CPU times: user 11min 21s, sys: 595 ms, total: 11min 22s
Wall time: 17 s
%%time
D, I = index.search(query, k)
CPU times: user 5.14 s, sys: 3.94 ms, total: 5.14 s
Wall time: 110 ms
recall = compute_recall(I, I_truth)
print(f"Recall: {recall}")
Recall: 0.8963409999999716
From the searching time of less than 1 second, we can see why HNSW is one of the best choice when looking for an extreme speed during searching phase. The reduction of recall is acceptable. But the longer time during creation of index and large memory footprint need to be considered.
4. LSH#
%%time
nbits = d * 8
index = faiss.IndexLSH(d, nbits)
index.train(corpus)
index.add(corpus)
CPU times: user 13.7 s, sys: 660 ms, total: 14.4 s
Wall time: 12.1 s
%%time
D, I = index.search(query, k)
CPU times: user 3min 20s, sys: 84.2 ms, total: 3min 20s
Wall time: 5.64 s
recall = compute_recall(I, I_truth)
print(f"Recall: {recall}")
Recall: 0.5856720000000037
As we covered in the last notebook, LSH is not a good choice when the data dimension is large. Here 128 is already burdened for LSH. As we can see, even we choose a relatively small nbits
of d * 8, the index creating time and search time are still pretty long. And the recall of about 58.6% is not satisfactory.
5. Scalar Quantizer Index#
%%time
qtype = faiss.ScalarQuantizer.QT_8bit
metric = faiss.METRIC_L2
index = faiss.IndexScalarQuantizer(d, qtype, metric)
index.train(corpus)
index.add(corpus)
CPU times: user 550 ms, sys: 18 ms, total: 568 ms
Wall time: 87.4 ms
%%time
D, I = index.search(query, k)
CPU times: user 7min 36s, sys: 169 ms, total: 7min 36s
Wall time: 12.7 s
recall = compute_recall(I, I_truth)
print(f"Recall: {recall}")
Recall: 0.990444999999872
Here scalar quantizer index’s performance looks very similar to the Flat index. Because the elements of vectors in the SIFT dataset are integers in the range of [0, 218]. Thus the index does not lose to much information during scalar quantization. For the dataset with more complex distribution in float32. The difference will be more obvious.
6. Product Quantizer Index#
%%time
M = 16
nbits = 8
metric = faiss.METRIC_L2
index = faiss.IndexPQ(d, M, nbits, metric)
index.train(corpus)
index.add(corpus)
CPU times: user 46.7 s, sys: 22.3 ms, total: 46.7 s
Wall time: 1.36 s
%%time
D, I = index.search(query, k)
CPU times: user 1min 37s, sys: 106 ms, total: 1min 37s
Wall time: 2.8 s
recall = compute_recall(I, I_truth)
print(f"Recall: {recall}")
Recall: 0.630898999999999
Product quantizer index is not standout in any one of the aspect. But it somewhat balance the tradeoffs. It is widely used in real applications with the combination of other indexes such as IVF or HNSW.