Faiss Indexes#
This tutorial will go through several widely used indexes in Faiss that fits different requirements, and how to use them.
Preparation#
For CPU usage, use:
%pip install faiss-cpu
For GPU on Linux x86_64 system, use Conda:
conda install -c pytorch -c nvidia faiss-gpu=1.8.0
import faiss
import numpy as np
np.random.seed(768)
data = np.random.random((1000, 128))
1. IndexFlat*
#
Flat index is the very fundamental index structure. It does not do any preprocess for the incoming vectors. All the vectors are stored directly without compression or quantization. Thus no training is need for flat indexes.
When searching, Flat index will decode all the vectors sequentially and compute the similarity score to the query vectors. Thus, Flat Index guarantees the global optimum of results.
Flat index family is small: just IndexFlatL2
and IndexFlatIP
, which are just different by the similarity metrics of Euclidean distance and inner product.
Usage:
d = 128 # dimension of the vector
k = 3 # number of nearest neighbors to search
# just simply create the index and add all the data
index = faiss.IndexFlatL2(d)
index.add(data)
Sanity check:
# search for the k nearest neighbor for the first element in data
D, I = index.search(data[:1], k)
print(f"closest elements: {I}")
print(f"distance: {D}")
closest elements: [[ 0 471 188]]
distance: [[ 0. 16.257435 16.658928]]
Flat Indexes guarantee the perfect quality but with terrible speed. It works well on small datasets or the cases that speed is not a crucial factor.
But what about the cases that speed is important? There’s no way to have it all. So we want some indexes that only sacrifice as small as possible quality to speed up. That’s why approximate nearest-neighbors (ANN) algorithms are widely accepted. Now we will go through a few popular ANN methods used in vector searching.
2. IndexIVF*
#
Intro#
Inverted File Flat (IVF) Index is a widely accepted technique to speed up searching by using k-means or Voronoi diagram to create a number of cells (or say, clusters) in the whole space. Then when given a query, an amount of closest cells will be searched. After that, k
closest elements to the query will be searched in those cells.
quantizer
is another index/quantizer to assign vectors to inverted lists.nlist
is the number of cells the space to be partitioned.nprob
is the nuber of closest cells to visit for searching in query time.
Tradeoff#
Increasing nlist
will shrink the size of each cell, which speed up the search process. But the smaller coverage will sacrifice accuracy and increase the possibility of the edge/surface problem discribed above.
Increasing nprob
will have a greater scope, preferring search quality by the tradeoff of slower speed.
Shortage#
There could be a problem when the query vector lands on the edge/surface of the cell. It is possible that the closest element falls into the neighbor cell, which may not be considered due to nprob
is not large enough.
Example#
nlist = 5
nprob = 2
# the quantizer defines how to store and compare the vectors
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
# note different from flat index, IVF index first needs training to create the cells
index.train(data)
index.add(data)
# set nprob before searching
index.nprobe = 8
D, I = index.search(data[:1], k)
print(f"closest elements: {I}")
print(f"distance: {D}")
closest elements: [[ 0 471 188]]
distance: [[ 0. 16.257435 16.658928]]
3. IndexHNSW*
#
Intro#
Hierarchical Navigable Small World (HNSW) indexing is a graph based method, which is an extension of navigable small world (NSW). It builds a multi-layered graph where nodes (vectors) are connected based on their proximity, forming “small-world” structures that allow efficient navigation through the space.
M
is the number of neighbors each vector has in the graph.efConstruction
is the number of entry points to explore when building the index.efSearch
is the number of entry points to explore when searching.
Tradeoff#
Increasing M
or efSearch
will make greater fidelity with reasonable longer time. Larger efConstruction
mainly increases the index construction time.
HNSW has great searching quality and speed. But it is memory-consuming due to the graph structure. Scaling up M
will cause a linear increase of memory usage.
Note that HNSW index does not support vector’s removal because removing nodes will distroy graph structure.
Thus HNSW is a great index to choose when RAM is not a limiting factor.
Example#
M = 32
ef_search = 16
ef_construction = 32
index = faiss.IndexHNSWFlat(d, M)
# set the two parameters before adding data
index.hnsw.efConstruction = ef_construction
index.hnsw.efSearch = ef_search
index.add(data)
D, I = index.search(data[:1], k)
print(f"closest elements: {I}")
print(f"distance: {D}")
closest elements: [[ 0 471 188]]
distance: [[ 0. 16.257435 16.658928]]
4. IndexLSH
#
Intro#
Locality Sensitive Hashing (LSH) is an ANN method that hashing data points into buckets. While well known use cases of hash function such as dictionary/hashtabel are trying to avoid hashing collisions, LSH trys to maximize hashing collisions. Similar vectors will be grouped into same hash bucket.
In Faiss, IndexLSH
is a Flat index with binary codes. Vectors are hashed into binary codes and compared by Hamming distances.
nbits
can be seen as the “resolution” of hashed vectors.
Tradeoff#
Increasing nbits
can get higher fidelity with the cost of more memory and longer searching time.
LSH suffers the curse of dimensionality when using a larger d
. In order to get similar search quality, the nbits
value needs to be scaled up to maintain the search quality.
Shortage#
LSH speeds up searching time with a reasonable sacrifice of quality. But that only applies to small dimension d
. Even 128 is already too large for LSH. Thus for vectors generated by transformer based embedding models, LSH index is not a common choice.
Example#
nbits = d * 8
index = faiss.IndexLSH(d, nbits)
index.train(data)
index.add(data)
D, I = index.search(data[:1], k)
print(f"closest elements: {I}")
print(f"distance: {D}")
closest elements: [[ 0 471 392]]
distance: [[ 0. 197. 199.]]