Table of contents
Open Table of contents
Introduction
Embeddings are a technique used in machine learning to convert text and image data to an array of numbers understood by computers. In this post we are going to talk about Text embeddings. The text embeddings are trained on Large deep learning models, they contain semantic meaning of the language and they can be used in applications like Search, Clustering, Recommendations, etc.
There are many articles online that talk about technical details of the embeddings. In this article we will talk about how to use them in practice. Next set of articles will cover the search, storage and querying of embeddings.
An example of a word embedding.
As said before, word embeddings are an array of numbers, here’s an example of a word embedding that produces an array of 5 numbers:
embedding("cat") = [0.0169, -0.0764, 0.0334, 0.0157, -0.0043]
A model generating an array of 5 numbers is said to be 5 dimensional vector
Using ML models to generate word embeddings.
The embeddings are trained using deep learning models to understand the semantic meanings of a language. There are many models available to choose from. In this post we will look at 2 popular embedding models - OpenAI’s text-embedding-ada-002
and gte-base
from GTE family of models.
Using OpenAI’s text-embedding-ada-002
OpenAI offers text-embedding-ada-002
to be used as an embedding model. This model produces a vector with 1536 dimensions, which makes it one of the largest in the industry.
The model is capable of generating embeddings for both words as well as sentences, the sentences can be of arbitrary length, but generally smaller the length better the information density.
OpenAI exposes embeddings through it’s embedding API /v1/embeddings
and through its SDKs in Python, JS and other programming languages. In this post, we will use Python to generate embeddings.
Installation
pip install openai
Usage
You need to have a valid OpenAI API Key to call the API. You may generate a new API Key in the OpenAI’s dashboard.
Once the API key is available, set the environment variable OPENAI_API_KEY
or pass the API Key to every function call like so:
openai.Embedding.create(model=model, input=text, api_key="your-api-key")
We can create a function that can be used to retrieve embeddings for a single sentence or a list of sentences:
def get_embedding(
text: str, model_name: str = "text-embedding-ada-002"
) -> List[float]:
response = openai.Embedding.create(model=model_name, input=text)
return response["data"][0]["embedding"]
def get_embeddings(
texts: List[str], model_name: str = "text-embedding-ada-002"
) -> List[List[float]]:
BATCH_SIZE = 2000 # Set it to a value less than 2048
# Create batches
batches = [
texts[i : i + BATCH_SIZE] for i in range(0, len(texts), BATCH_SIZE)
]
embeddings = []
for batch in batches:
response = openai.Embedding.create(model=model_name, input=batch)
embeddings.extend([e["embedding"] for e in response["data"]])
return embeddings
cat_embeddings = get_embedding(text="cat")
animals_embeddings = get_embeddings(texts=["cat", "dog", "horse"])
Pro tip: We can parallelize the
get_embeddings
function to retrieve embeddings for multiple text values in parallel. The code example is shown at the end of the post here
Using gte-base
model from GTE Family of models
Althogh OpenAI offers a cost-effective model and no overhead of maintaining model ourselves, one problem with the OpenAI’s embeddings is it’s dimension size of 1536 which makes it one of the largest in the industry. This size has performance implications.
I came across a blog post from Supabase - Fewer dimensions are better, which talks about alternatives to ada embeddings model and trade offs. One thing that stood out is GTE embeddings models, they offer fewer dimensional models with better performance than Ada models, let’s explore them further.
GTE stands for General Text Embeddings
. Here’s a description from the HuggingFace about the model:
The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.
Installation
pip install sentence-transformers
Usage
Using sentence transformer library, it’s easy to get embeddings from the model; the model is downloaded from HuggingFace library and is stored locally on your device.
from sentence_transformers import SentenceTransformer
def get_embedding(model, text: str) -> List[float]:
return model.encode(text).tolist()
def get_embeddings(model, texts: List[str]) -> List[List[float]]:
return model.encode(texts).tolist()
model = SentenceTransformer("thenlper/gte-base")
cat_embeddings = get_embedding(model=model, text="cat")
animals_embeddings = get_embeddings(model=model, texts=["cat", "dog", "horse"])
Pro tip: The GTE embedding computations are CPU intensive process, we can parallelize them by using multiple cores (Or even using GPUs). I have provided an example that parallizes the workload to multiple CPU cores. The code example is shown at the end of the post here
Comparison
Model performance
MTEB has done the comparision of various models on tasks such as Search, clustering, etc and provided a benchmark of the results. I have given the summary of the 3 models here, for more details, please visit MTEB leaderboard
Model Name | Dimension | Average accuracy on various tasks |
---|---|---|
text-embedding-ada-002 | 1536 | 60.99 |
gte-base | 768 | 62.39 |
gte-small | 384 | 61.36 |
Both
gte-base
andgte-small
models provides a better performance on average than the Ada embedding model, while being much smaller in size.
Cosine similarity
Cosine similarity measures the similarity between two word embeddings. In simple terms, a higher score corresponds to greater similarity, while a lower score indicates dissimilarity.
We can visualize the cosine similarity of the 6 sample words using a heatmap. In this heatmap, darker shades of blue represent higher scores and thus more similar words, whereas lighter shades of blue signify more distant or dissimilar words.
OpenAI embeddings
The
text-embedding-ada-002
model produces word embeddings where similar words such as (Apple, Orange) have higher cosine similarity than that of dissimilar words (Such as River and Apple)
GTE Embeddings
Heatmap for
gte-base
model. The heatmap looks similar to that oftext-embedding-ada-002
model.
Conclusion
Both proprietary and open-source models can be used to generate embeddings. In this post, we compared two models and observed that they produce similar results. In terms of performance, the GTE family of models ranks better. The GTE models also produce vectors with much smaller dimensions, which is a significant advantage when using embeddings in production use cases, such as search and clustering.
Pro tips
Parallelizing OpenAI’s get_embeddings
function
openai_embedding.py
import concurrent.futures
from typing import List
class OpenAIEmbedding:
def __init__(
self, model_name: str = "text-embedding-ada-002", api_key: str = ""
) -> None:
self.model_name = model_name
self.encoding = tiktoken.encoding_for_model(model_name=model_name)
self.api_key = api_key
def get_embeddings(self, texts: List[str]) -> List[List[float]]:
print("Getting embeddings for texts...")
return self.__get_embeddings_parallel(texts=texts)
def __get_embeddings_for_batch(self, batch: List[str]) -> List[List[float]]:
"""Helper function to fetch embeddings for a batch."""
response = openai.Embedding.create(
model=self.model_name, input=batch, api_key=self.api_key
)
return [e["embedding"] for e in response["data"]]
def __get_embeddings_parallel(self, texts: List[str]) -> List[List[float]]:
BATCH_SIZE = 2000
batches = [texts[i : i + BATCH_SIZE] for i in range(0, len(texts), BATCH_SIZE)]
embeddings = []
# Use ThreadPoolExecutor to fetch data for each batch in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
for batch_result in executor.map(self.__get_embeddings_for_batch, batches):
embeddings.extend(batch_result)
return embeddings
Parallelizing GTE’s get_embeddings
function
gte_embedding.py
class GTEEmbedding:
def __init__(self, model_name: str = "thenlper/gte-base") -> None:
self.model = SentenceTransformer(model_name)
def _encode_batch(self, batch: List[str]) -> List[List[float]]:
# This is a helper function to encode a batch of texts
print(f"Encoding batch of {len(batch)} texts...")
return self.model.encode(batch).tolist()
def get_embeddings(self, texts: List[str]) -> List[List[float]]:
from multiprocessing import Pool, cpu_count
batch_size = 1000
batches = [
texts[i : i + batch_size] for i in range(0, len(texts), batch_size)
]
# Limit the workers to 5 or available CPUs, whichever is less
max_workers = min(5, cpu_count())
with Pool(max_workers) as pool:
results = pool.map(self._encode_batch, batches)
# Flatten the results
print("Flattening results...")
embeddings = [
embedding for batch_result in results for embedding in batch_result
]
return embeddings