Case Study 2.2: How OpenAI and Facebook Speed Up Queries with Vector Databases
Why Vector Databases
Objective: Understand approximate indices for high-dimensional data
When you're dealing with high-dimensional data, finding Approximate Nearest Neighbors (ANNs) is essential. OpenAI, for instance, uses text embeddings to sift through semantic layers across billions of documents. Traditional hashing methods simply don't make the cut.
Vector Databases: The Machine Behind Modern AI
Vector databases are engineered to handle high-dimensional vectors, like those from OpenAI's embeddings. They efficiently execute similarity searches in complex dimensional spaces using ANN algorithms.
FAISS: Facebook's Workhorse
FAISS is the library of choice for fast similarity search and clustering of dense vectors.
- Hardware: It uses GPUs to manage the indexing and searching, tasks that would otherwise overwhelm a CPU.
Example: Navigating a 768-Dimensional Maze
Consider a text embedding as a 768-dimensional vector, courtesy of Sentence-BERT. You can experiment with your own embeddings in FAISS via this Colab.
Here's how it works: embed 1000 strings into FAISS's vector index, then query for three strings to find their approximate nearest neighbors.
# Example code block for embedding and querying
Search Results:
Notice how "String 42 is making me hungry" finds an exact match (Distance: 0.0). It also pulls in related strings in this 768-dimensional space. Meanwhile, “String 42” and “String 42 is making me thirsty” find similar strings, though not exact matches.
Locality Sensitive Hashing (LSH): The Speed Specialist
LSH offers a speed boost by trading off some accuracy. Test out Vector Search and LSH in this Hashing Colab.
In this setup, the IndexLSH object is trained on a dataset and used to find approximate nearest neighbors for a new query point.
# Example code block for LSH
The IndexLSH object, once trained, identifies approximate nearest neighbors by clustering vectors that land on the same side of random planes.
Standard Hashing vs. LSH: A Quick Comparison
| Feature | Standard Hashing | Locality Sensitive Hashing |
|---|---|---|
| Goal | Avoid collisions (Unique buckets) | Encourage collisions (Similar items together) |
| Logic | Small input change → Massive hash change | Small input change → Small/No hash change |
| Usage | Exact lookups, Join partitioning | Similarity search, Recommendation engines |
Bottom Line: LSH adapts traditional indexing to manage the complexity of high-dimensional AI data, enabling us to search for "closeness" instead of mere "equality."