Vector databases are very useful for interacting with LLMs and building AI apps. But what are they? And why are the useful? I’ll answer these questions and more in this primer, let’s dig in!
Before we start we should quickly cover what a vector is within the context of language models. Language models unsurprisingly work with language, which consists of words, phrases and sentences. But how can we get computers to understand language better? Enter vectors! Vectors represent pieces of language as a high-dimensional numerical construct. That’s a mouthful, so let’s explain through analogy.
From Words to Numbers
Imagine words are like colors. To represent colors on a computer, we often use numbers. For instance, the color red might be represented by the numbers [255, 0, 0]. Similarly, in language models, we represent words with numbers so that the computer can understand and work with them. These sets of numbers are what we call "vectors."
An example borrowed from OpenAI’s docs:
The piece of text:
Your text string goes here
Would look like the following as a vector:
[
-0.006929283495992422,
-0.005336422007530928,
...
-4.547132266452536e-05,
-0.024047505110502243
]
The ellipses omit many other numbers, since vectors can span thousands of dimensions (hence high-dimensional). As of this writing OpenAI’s models, have vectors with 1536 dimensions.
Side note: Embeddings are specialized vectors that transform items, such as words or images, into numerical representations capturing context or meaning. While all embeddings are vectors, not all vectors are embeddings. However, you may hear "vectors" and "embeddings" used interchangeably in some discussions, which can blur the distinction between the two.
If you would like a further visual explanation of vectors/embeddings, Josh at Mythical AI does a great job walking through it.
TLDR: I think of vectors as a weighting of different “concepts” in a chunk of text, in the same way RGB is a weighting of a color as it relates to the primary colors.
Vector DBs
Now that we understand vectors, let’s dive into how to work with them at scale with an example. Suppose you have a repository with tons of text and documents, like your journal entries for the last 20 years.
Now suppose you’d want to see for any given entry, the most related entry. So for example, if I have an entry that describes my excitement about a new tech thing, like LLMs, I’d want to surface entries that discuss similar topics like other tech happenings. We can accomplish this by converting all entries into vectors1 and then comparing the vectors to each other using a measure like cosine similarity. I’ll spare you the math refresher; essentially, we can find the distance between vectors by aggregating the distance between dimensions, and this is a measure of how similar they are2.
If we did this naively, and if the corpus was sufficiently large, we’d run into problem. While computing vector similarity between any two vectors is fast, in order to accomplish the task of finding the closest journal entry, we have to compare one vector against all the other ones. Furthermore, we want to rank all entries by similarity; it’s insufficient to know the distance between one vector and all the others; we want an ordered list.
For an application, where the user experience is sensitive to latency, we need speeeed when performing this calculation. Enter vector databases! Vector databases (also known as vector stores in some contexts) are specialized databases designed to handle and process vectors efficiently. They enable fast similarity search.
Side note: Database specialization is not new. We have caches like Redis for fast key-value querying, we have relational databases for efficient structured data indexing and querying, and we have analytics databases (OLAP) like BigQuery to make querying vast quantities of data fast.
There are many new companies and products in this space, including Pinecone, Chroma, Weaviate. It’s also worth noting that you might not need a new database to unlock vector search. Mainstream databases have vector support, including Postgres, Elasticsearch and Redis. You can even just use numpy in Python!
In a sense, “database” is a misnomer because the appropriate solution for a particular use-case may not require a separate or new service with its own persistence. For small projects or projects that are batch-oriented, an in-process solution like Numpy may be enough. For business applications where the dataset is likely larger, and likely requires some real-time querying, a more specialized solution may be more appropriate. So like all good engineering solutions, “it depends”.
So what?
Vector DBs are useful in LLM apps because they help manage the context window token limit.
LLMs have a limited context window for input and output tokens.3 The base GPT-4 model has a limit of 8,000 tokens, so it is often impractical to fit all the context related to a query or search. For example, if you wanted to ask a question about your company’s documentation the input is likely to exceed the context window token limit. With a vector DB, a corpus like a knowledge base can be indexed in chunks that are small enough for later retrieval by an LLM. Chunking is a separate topic altogether, but for the documentation example we could take all the articles and break them up in 500-token chunks that are then indexed in a vector DB. The 500-token chunk size is arbitrary, and in this example, would allow the retrieval of multiple chunks for later consumption by the LLM. For a particular user query, we would embed the query and turn it into a vector, and then perform similarity search as previously described. The similarity search results represent an ordered list of the chunks that are most relevant. Altogether, this is one way to perform retrieval augmented generation (RAG), a powerful technique for leveraging LLMs.
Conclusion
Those are the basics of vector, vector DBs and how to use them in LLMs. Vectors/embeddings are extremely useful for building a production-grade LLM-based app.
However, there’s so much more to explore. Here are a few additional considerations:
Your choice in vector DB can be influenced by ancillary features, like metadata querying support (eg “give me all documents created between Jan 2021 and Jan 2023“) and integrations with APIs and models. For example, some vector DBs integrate directly with OpenAI APIs, so developers do not need to write glue scripts for vectors/embeddings themselves.
Semantic search is often not enough for great search. Keyword/traditional full-text search is still useful to augment semantic search. Combining semantic search with traditional search creates another challenge: how do you optimally balance the two? Thankfully there are a few solutions popping up.
I hope you found this primer helpful!
There is more engineering to be done here, since entries may be too long for an embedding model, and thus require chunking.
Cosine similarity is not the only metric we can use to compare vectors, but is very common one.
LLMs with very large context windows are also not clearly a panacea, since retrieval performance appears to suffer with the current generation of LLMs.