Case Study 2.2: How OpenAI and Facebook Speed Up Queries with Vector Databases

Case Study 2.2 Reading Time: 7 mins

Why Vector Databases

Objective: Understand approximate indices for high-dimensional data

When you're dealing with high-dimensional data, finding Approximate Nearest Neighbors (ANNs) is essential. OpenAI, for instance, uses text embeddings to sift through semantic layers across billions of documents. Traditional hashing methods simply don't make the cut.


Vector Databases: The Machine Behind Modern AI

Vector databases are engineered to handle high-dimensional vectors, like those from OpenAI's embeddings. They efficiently execute similarity searches in complex dimensional spaces using ANN algorithms.

FAISS: Facebook's Workhorse

FAISS is the library of choice for fast similarity search and clustering of dense vectors.


Example: Navigating a 768-Dimensional Maze

Consider a text embedding as a 768-dimensional vector, courtesy of Sentence-BERT. You can experiment with your own embeddings in FAISS via this Colab.

Sentence-BERT 768D Vector

Example of a 768-dimensional text embedding vector

Here's how it works: embed 1000 strings into FAISS's vector index, then query for three strings to find their approximate nearest neighbors.

# Example code block for embedding and querying

Search Results:

FAISS Search Results

FAISS Nearest Neighbor output

Notice how "String 42 is making me hungry" finds an exact match (Distance: 0.0). It also pulls in related strings in this 768-dimensional space. Meanwhile, “String 42” and “String 42 is making me thirsty” find similar strings, though not exact matches.


Locality Sensitive Hashing (LSH): The Speed Specialist

LSH offers a speed boost by trading off some accuracy. Test out Vector Search and LSH in this Hashing Colab.

In this setup, the IndexLSH object is trained on a dataset and used to find approximate nearest neighbors for a new query point.

# Example code block for LSH

The IndexLSH object, once trained, identifies approximate nearest neighbors by clustering vectors that land on the same side of random planes.


Standard Hashing vs. LSH: A Quick Comparison

Feature Standard Hashing Locality Sensitive Hashing
Goal Avoid collisions (Unique buckets) Encourage collisions (Similar items together)
Logic Small input change → Massive hash change Small input change → Small/No hash change
Usage Exact lookups, Join partitioning Similarity search, Recommendation engines

Bottom Line: LSH adapts traditional indexing to manage the complexity of high-dimensional AI data, enabling us to search for "closeness" instead of mere "equality."