How Vector Databases Became a Multi-Million Dollar Market
Vector databases - The cost-saver for LLM data augmentation.
The typical newsletter at AIport covers the latest AI news and updates.
Today, we are beginning with a new type of article about recent market trends in AI.
We decided to begin this series with vector databases and how they achieved a billion-dollar market cap in a matter of a few weeks.
Let us know if you liked it and want more of this type of content.
Vector databases are not new.
They have existed for a long time and have been at the core of recommender systems and search engines, which we use almost daily.
But have a look at these Google search trends for the topic “Vector database” over the last five years:
A technology that was almost of no interest to anyone has exploded in the last year or so.
In fact, companies that never existed two years ago have raised millions to develop dedicated vector databases or augment traditional databases (SQL or NoSQL) with vector search capabilities.
These are some dedicated vector databases that have gained traction lately:
Pinecone raised $138M.
Chroma DB raised $20M.
Qdrant raised $37.8M.
Weaviate raised $67.7M.
And there are many more…
So, what made vector databases so useful almost overnight? Let’s find out!
The problem: unstructured data
Unstructured data is everywhere.
The voice notes you receive → Unstructured data.
The images in your phone → Unstructured data.
The emails in your inbox → Unstructured data.
The lines in this article → Unstructured data.
The videos you watch → Unstructured data.
This data is important, of course. Being able to query this data to extract information, just like we query a traditional SQL database, would be great.
Sample query: “From a library of photos, select all photos with a mountain.”
However, storing this data in a traditional DBMS is difficult because it is designed to store structured data with well-defined schemas.
Simply put, how would we even define columns to store audio/video/image/text?
There’s one more issue. Consider the sample query written above. There are possibly 20+ ways of writing this query:
Pictures with mountains.
Photos with mountains.
Pics with a mountain.
Mountain pictures.
Mountain pics
and more…
In other words, there’s no standard syntax (like SQL’s — Select * From TABLE where Condition
) to query such data.
The solution: vectors
One possible solution was to encode unstructured data into high-dimensional vectors (using ML techniques) and store those vectors in table format (we are simplifying a bit here).
This numerical vector is called an embedding, and it captures the features and characteristics of the underlying data being embedded.
Consider word embeddings, for instance.
When we analyze them, the representations of fruits are found close to each other, cities form another cluster, and so on.
A database specifically designed to store vectors is a vector database (let’s put a pin in that).
This solves both problems we discussed earlier:
We found a way to store unstructured data ✅
Embeddings can handle linguistic diversity at query time ✅
Vector databases with LLMs
Vector databases are profoundly used in conjunction with LLMs.
It is one of the most inexpensive ways to make an LLM practically useful as it allows the model to (almost confidently) generate text based on something it was never trained on.
But how?
Some background details
LLMs can’t be trained on every piece of information in the world.
Consider LLaMA 2 — an open-source LLM by Meta.
When they trained it, they did not have access to:
Information generated after the data snapshot date.
The internal docs of your company.
Your private datasets.
And more.
For instance, if the model was trained on data before 31st Jan 2024 (the snapshot date), it will have no clue what happened after that:
It is also possible that the model will hallucinate.
To solve this:
You could continue to train the model on incoming data or fine-tune it on the internal dataset. But this is challenging because these models are hundreds of GB in size.
Or you could use a vector database, which eliminates fine-tuning requirements.
Here’s how it works:
Encode the additional data and store its vectors in a vector database.
When the LLM wants to generate text, query the vector database to retrieve vectors that are similar to the prompt’s vector. A similarity search can be used here.
Pass the retrieved information along with the user’s prompt to the LLM.
Done!
As a result, the model now gets access to information it was never trained on.
The idea is called RAG, and it is demonstrated below:
By injecting relevant details into the prompt, you can make the LLM generate more precise answers even if it was not explicitly trained on that data.
Now, you might have understood why vector databases have exploded in recent years.
In essence, a vector database makes the LLM much more “real-time” in nature because, ideally, you may want to interact with the data that was generated just 2 seconds after the data snapshot time.
This is not possible with fine-tuning. This approach will never be “real-time.”
But when using a vector database, all we have to do is dump the incoming data into it, and query it when needed.
Investing just a few hundred dollars a month in a vector database can prevent tens of thousands of dollars in fine-tuning costs.
As all businesses love to save (or make) more money, they saw this opportunity for saving costs and ran with it.
This led to a ridiculous increase in demand, which further led to the incorporation of companies valued at hundreds of millions of dollars.
Concluding remark
Of course, vector databases are great!
But like any other technology, they do have pros and cons.
Just because vector databases sound cool, it does not mean that you have to adopt them everywhere you wish to query vectors.
For small-scale applications with a limited number of vectors, simpler solutions like NumPy arrays and doing an exhaustive search will suffice.
There’s no need to move to vector databases unless you see any benefits, such as latency improvement in your application, cost reduction, and more.
Thanks for reading!
Excellent explanation.
I have mentioned this newsletter in my newsletter this week. Here's the link: https://aiforreal.substack.com/p/are-you-ai-positive-or-ai-negative