Skip to content

Understanding Embeddings

Posted on:October 7, 2023 at 03:00 PM

Table of contents

Open Table of contents

Introduction

Embeddings are a technique used in machine learning to convert text and image data to an array of numbers understood by computers. In this post we are going to talk about Text embeddings. The text embeddings are trained on Large deep learning models, they contain semantic meaning of the language and they can be used in applications like Search, Clustering, Recommendations, etc.

There are many articles online that talk about technical details of the embeddings. In this article we will talk about how to use them in practice. Next set of articles will cover the search, storage and querying of embeddings.

An example of a word embedding.

As said before, word embeddings are an array of numbers, here’s an example of a word embedding that produces an array of 5 numbers:

embedding("cat") = [0.0169, -0.0764, 0.0334, 0.0157, -0.0043]

A model generating an array of 5 numbers is said to be 5 dimensional vector

Using ML models to generate word embeddings.

The embeddings are trained using deep learning models to understand the semantic meanings of a language. There are many models available to choose from. In this post we will look at 2 popular embedding models - OpenAI’s text-embedding-ada-002 and gte-base from GTE family of models.

Using OpenAI’s text-embedding-ada-002

OpenAI offers text-embedding-ada-002 to be used as an embedding model. This model produces a vector with 1536 dimensions, which makes it one of the largest in the industry.

The model is capable of generating embeddings for both words as well as sentences, the sentences can be of arbitrary length, but generally smaller the length better the information density.

OpenAI exposes embeddings through it’s embedding API /v1/embeddings and through its SDKs in Python, JS and other programming languages. In this post, we will use Python to generate embeddings.

Installation

pip install openai

Usage

You need to have a valid OpenAI API Key to call the API. You may generate a new API Key in the OpenAI’s dashboard. Once the API key is available, set the environment variable OPENAI_API_KEY or pass the API Key to every function call like so:

openai.Embedding.create(model=model, input=text, api_key="your-api-key")

We can create a function that can be used to retrieve embeddings for a single sentence or a list of sentences:

def get_embedding(
    text: str, model_name: str = "text-embedding-ada-002"
) -> List[float]:
    response = openai.Embedding.create(model=model_name, input=text)
    return response["data"][0]["embedding"]


def get_embeddings(
    texts: List[str], model_name: str = "text-embedding-ada-002"
) -> List[List[float]]:
    BATCH_SIZE = 2000  # Set it to a value less than 2048

    # Create batches
    batches = [
        texts[i : i + BATCH_SIZE] for i in range(0, len(texts), BATCH_SIZE)
    ]
    embeddings = []
    for batch in batches:
        response = openai.Embedding.create(model=model_name, input=batch)
        embeddings.extend([e["embedding"] for e in response["data"]])

    return embeddings

cat_embeddings = get_embedding(text="cat")
animals_embeddings = get_embeddings(texts=["cat", "dog", "horse"])

Pro tip: We can parallelize the get_embeddings function to retrieve embeddings for multiple text values in parallel. The code example is shown at the end of the post here

Using gte-base model from GTE Family of models

Althogh OpenAI offers a cost-effective model and no overhead of maintaining model ourselves, one problem with the OpenAI’s embeddings is it’s dimension size of 1536 which makes it one of the largest in the industry. This size has performance implications.

I came across a blog post from Supabase - Fewer dimensions are better, which talks about alternatives to ada embeddings model and trade offs. One thing that stood out is GTE embeddings models, they offer fewer dimensional models with better performance than Ada models, let’s explore them further.

GTE stands for General Text Embeddings . Here’s a description from the HuggingFace about the model:

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-largeGTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrievalsemantic textual similaritytext reranking, etc.

Installation

pip install sentence-transformers

Usage

Using sentence transformer library, it’s easy to get embeddings from the model; the model is downloaded from HuggingFace library and is stored locally on your device.

from sentence_transformers import SentenceTransformer

def get_embedding(model, text: str) -> List[float]:
	return model.encode(text).tolist()

def get_embeddings(model, texts: List[str]) -> List[List[float]]:
	return model.encode(texts).tolist()

model = SentenceTransformer("thenlper/gte-base")

cat_embeddings = get_embedding(model=model, text="cat")
animals_embeddings = get_embeddings(model=model, texts=["cat", "dog", "horse"])

Pro tip: The GTE embedding computations are CPU intensive process, we can parallelize them by using multiple cores (Or even using GPUs). I have provided an example that parallizes the workload to multiple CPU cores. The code example is shown at the end of the post here

Comparison

Model performance

MTEB has done the comparision of various models on tasks such as Search, clustering, etc and provided a benchmark of the results. I have given the summary of the 3 models here, for more details, please visit MTEB leaderboard

Model NameDimensionAverage accuracy on various tasks
text-embedding-ada-002153660.99
gte-base76862.39
gte-small38461.36

Both gte-base and gte-small models provides a better performance on average than the Ada embedding model, while being much smaller in size.

Cosine similarity

Cosine similarity measures the similarity between two word embeddings. In simple terms, a higher score corresponds to greater similarity, while a lower score indicates dissimilarity.

We can visualize the cosine similarity of the 6 sample words using a heatmap. In this heatmap, darker shades of blue represent higher scores and thus more similar words, whereas lighter shades of blue signify more distant or dissimilar words.

OpenAI embeddings

OpenAI embeddings cosine similarity

The text-embedding-ada-002 model produces word embeddings where similar words such as (Apple, Orange) have higher cosine similarity than that of dissimilar words (Such as River and Apple)

GTE Embeddings

GTE embeddings cosine similarity

Heatmap for gte-base model. The heatmap looks similar to that of text-embedding-ada-002 model.

Conclusion

Both proprietary and open-source models can be used to generate embeddings. In this post, we compared two models and observed that they produce similar results. In terms of performance, the GTE family of models ranks better. The GTE models also produce vectors with much smaller dimensions, which is a significant advantage when using embeddings in production use cases, such as search and clustering.

Pro tips

Parallelizing OpenAI’s get_embeddings function

openai_embedding.py
import concurrent.futures
from typing import List

class OpenAIEmbedding:
    def __init__(
        self, model_name: str = "text-embedding-ada-002", api_key: str = ""
    ) -> None:
        self.model_name = model_name
        self.encoding = tiktoken.encoding_for_model(model_name=model_name)
        self.api_key = api_key

    def get_embeddings(self, texts: List[str]) -> List[List[float]]:
        print("Getting embeddings for texts...")
        return self.__get_embeddings_parallel(texts=texts)

    def __get_embeddings_for_batch(self, batch: List[str]) -> List[List[float]]:
        """Helper function to fetch embeddings for a batch."""
        response = openai.Embedding.create(
            model=self.model_name, input=batch, api_key=self.api_key
        )
        return [e["embedding"] for e in response["data"]]

    def __get_embeddings_parallel(self, texts: List[str]) -> List[List[float]]:
        BATCH_SIZE = 2000
        batches = [texts[i : i + BATCH_SIZE] for i in range(0, len(texts), BATCH_SIZE)]

        embeddings = []
        # Use ThreadPoolExecutor to fetch data for each batch in parallel
        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
            for batch_result in executor.map(self.__get_embeddings_for_batch, batches):
                embeddings.extend(batch_result)

        return embeddings

Parallelizing GTE’s get_embeddings function

gte_embedding.py
class GTEEmbedding:
    def __init__(self, model_name: str = "thenlper/gte-base") -> None:
        self.model = SentenceTransformer(model_name)

    def _encode_batch(self, batch: List[str]) -> List[List[float]]:
        # This is a helper function to encode a batch of texts
        print(f"Encoding batch of {len(batch)} texts...")
        return self.model.encode(batch).tolist()

    def get_embeddings(self, texts: List[str]) -> List[List[float]]:
        from multiprocessing import Pool, cpu_count

        batch_size = 1000
        batches = [
            texts[i : i + batch_size] for i in range(0, len(texts), batch_size)
        ]

        # Limit the workers to 5 or available CPUs, whichever is less
        max_workers = min(5, cpu_count())

        with Pool(max_workers) as pool:
            results = pool.map(self._encode_batch, batches)

        # Flatten the results
        print("Flattening results...")
        embeddings = [
            embedding for batch_result in results for embedding in batch_result
        ]
        return embeddings