LLMs store data using Vector DB. Why and how ?

non-deterministic search

Apr 16, 2023

Traditionally, computing has been deterministic, which refers to the inherent consistency, repeatability, and provability of outcomes in data processing. This is because the output strictly adheres to the programming logic (code written by software developers).

#LLMs leverage similarity search to process information. During the training phase, LLMs identify similarities among text tokens and create an extensive neural network to capture these patterns. These patterns are then represented in a high-dimensional vector space, allowing for a more nuanced understanding of textual data.

When processing user input, input sentence is converted into a vector. The OpenAI software then searches for the nearest tokens in the multi-dimensional space, using the shortest distance as a measure of similarity. There is no developers doing the coding to write different logic. Each sentence is following same vector search process.

Below I show how a sentence is transform into a vector. The model (all-MiniLM-L6-v2) in this example is using 768 dimensions. OpenAI has its own API endpoint that do similar processing (text-embedding-ada-002) with 1536 dimensions.

>>> from sentence_transformers import SentenceTransforme
>>> sentences = ["We are at the dawn of a new era...", "Each sentence is converted"]
>>> 
>>> model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
>>> embeddings = model.encode(sentences)
>>> print(embeddings)
[[-9.31226462e-03 -5.73424622e-03  2.18255073e-02 -8.62214249e-03
  -1.94084086e-02  2.32371558e-02 -3.61166969e-02 -5.94260842e-02
   8.96405503e-02  8.60120635e-03 .... .... ] [... .... ]]

Reference :

OpenAI embedding doc (https://openai.com/blog/new-and-improved-embedding-model )

Pervasive Technology Institute at Indiana University "Introduction of document similarity" video

Data Matters, in AI era.

Discussion about this post