Vector Embeddings and Semantic Search: The Foundation of Modern AI

Vector Embeddings and Semantic Search

Introduction

Traditional keyword search has a fundamental limitation: it can only match exact words. Search for “happy” and you won’t find documents containing “joyful,” “delighted,” or “cheerful”—despite their semantic similarity.

Vector embeddings solve this problem by representing text as points in a high-dimensional space where semantic meaning determines distance. Words, sentences, or entire documents with similar meanings cluster together, enabling truly semantic search.

This technology underpins nearly every modern AI application:

Search engines: Understanding query intent beyond keywords.
Recommendation systems: Finding similar content based on meaning.
LLM applications: Retrieval-Augmented Generation (RAG).
Question answering: Matching questions to relevant answers.
Duplicate detection: Finding semantically identical content.

In this post, we’ll explore how embeddings work, from classical approaches like word2vec to modern transformer-based methods, and how to implement semantic search at scale using vector databases like Elasticsearch.

What Are Embeddings?

An embedding is a learned representation that maps discrete objects (words, sentences, images) to continuous vectors in a way that captures semantic relationships.

The Core Idea

# Traditional representation (one-hot encoding)
"cat"  = [1, 0, 0, 0, ..., 0]  # 50,000 dimensions, sparse
"dog"  = [0, 1, 0, 0, ..., 0]
"car"  = [0, 0, 1, 0, ..., 0]

# Embedding representation (dense vectors)
"cat"  = [0.2, -0.4, 0.7, 0.1, ...]  # 768 dimensions, dense
"dog"  = [0.3, -0.3, 0.8, 0.0, ...]  # Similar to cat
"car"  = [0.8, 0.5, -0.2, 0.9, ...]  # Different from cat/dog

Key properties:

Dense: Most values are non-zero.
Low-dimensional: Typically 128-1536 dimensions vs millions for one-hot.
Semantic: Similar meanings → similar vectors.
Learned: Trained from data, not hand-crafted.

Transforming Language into Coordinates

Word2Vec: The Breakthrough (2013)

The Distributional Hypothesis

“You shall know a word by the company it keeps” — J.R. Firth (1957)

Word2Vec, introduced by Mikolov et al. at Google, operationalized this idea: words appearing in similar contexts have similar meanings.

Two Architectures

1. CBOW (Continuous Bag of Words)

Predict the center word from surrounding context:

Context: "The cat sat on the ___"
Target:  "mat"

Input:  [the, cat, sat, on, the] → Average embeddings
Output: Predict "mat"

2. Skip-gram

Predict context words from center word (more popular):

Target:  "cat"
Output:  Predict context words: [the, sat, on, mat]

Word2Vec Architecture

Skip-gram Implementation

import torch
import torch.nn as nn

class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.output = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, target_word):
        # Get embedding for target word
        embedded = self.embedding(target_word)
        
        # Predict context words
        output = self.output(embedded)
        return output

# Training
vocab_size = 10000
embedding_dim = 300

model = SkipGramModel(vocab_size, embedding_dim)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

# Training loop
for target_word, context_words in training_pairs:
    optimizer.zero_grad()
    
    # Forward pass
    predictions = model(target_word)
    
    # Loss for each context word
    loss = sum(criterion(predictions, context_word) 
               for context_word in context_words)
    
    loss.backward()
    optimizer.step()

Optimization: Negative Sampling

Training on all vocabulary words is expensive. Negative sampling trains on:

Positive pairs: (target, actual context).
Negative pairs: (target, random words).

def negative_sampling_loss(target_emb, positive_emb, negative_embs):
    """
    Maximize: sigmoid(target · positive)
    Minimize: sigmoid(target · negative_i)
    """
    # Positive sample
    positive_score = torch.sigmoid(torch.dot(target_emb, positive_emb))
    positive_loss = -torch.log(positive_score)
    
    # Negative samples
    negative_scores = torch.sigmoid(torch.matmul(negative_embs, target_emb))
    negative_loss = -torch.sum(torch.log(1 - negative_scores))
    
    return positive_loss + negative_loss

Remarkable Properties

Word2Vec embeddings exhibit semantic and syntactic relationships:

# Vector arithmetic
king - man + woman ≈ queen
paris - france + spain ≈ madrid
walking - walk + swim ≈ swimming

# Similarity
cosine_similarity("king", "queen")   # High (~0.7)
cosine_similarity("king", "car")     # Low (~0.1)

Word2Vec Embedding Space Visualization

From Words to Sentences: Evolution of Embeddings

Limitation of Word2Vec

Word2Vec produces fixed embeddings: each word has one vector regardless of context.

# "bank" has same embedding in both:
"I went to the bank to withdraw money"      # financial institution
"I sat on the river bank to relax"          # riverbank

GloVe (2014): Global Vectors

Combines global matrix factorization with local context:

# Objective: embedding should reflect co-occurrence statistics
J = Σ f(X_ij) * (w_i · w_j + b_i + b_j - log(X_ij))²

where:
- X_ij = co-occurrence count of word i and j.
- w_i, w_j = word vectors.
- f(X_ij) = weighting function (more weight to frequent pairs).

ELMo (2018): Context-Aware Embeddings

First contextualized embeddings using bidirectional LSTMs:

# Different embeddings based on context
embedding_1 = elmo("I went to the bank to withdraw")
embedding_2 = elmo("I sat on the river bank")

# embedding_1["bank"] ≠ embedding_2["bank"]

Transformer-Based Embeddings: State of the Art

BERT Embeddings (2018+)

BERT and its variants produce contextualized embeddings through self-attention:

from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "The cat sat on the mat"
inputs = tokenizer(text, return_tensors='pt')

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    
# Different embedding strategies:
# 1. Last layer [CLS] token (sentence embedding)
cls_embedding = outputs.last_hidden_state[:, 0, :]  # [1, 768]

# 2. Mean pooling of all tokens
token_embeddings = outputs.last_hidden_state  # [1, seq_len, 768]
attention_mask = inputs['attention_mask']
mean_embedding = torch.sum(token_embeddings * attention_mask.unsqueeze(-1), dim=1)
mean_embedding = mean_embedding / torch.sum(attention_mask, dim=1, keepdim=True)

# 3. Max pooling
max_embedding = torch.max(token_embeddings, dim=1)[0]

Sentence Transformers (2019)

BERT wasn’t optimized for semantic similarity. Sentence-BERT (SBERT) fine-tunes BERT specifically for producing semantically meaningful sentence embeddings.

Architecture:

       Input Sentences
            ↓
   ┌────────────────────┐
   │  BERT / RoBERTa    │
   │  (shared weights)  │
   └────────┬───────────┘
            ↓
      Mean Pooling
            ↓
     Normalization
            ↓
   Sentence Embedding
      (768 dimensions)

Training with Siamese Networks:

from sentence_transformers import SentenceTransformer, losses
from torch.utils.data import DataLoader

model = SentenceTransformer('bert-base-uncased')

# Training data: (sentence1, sentence2, similarity_score)
train_examples = [
    ("The cat sits on the mat", "A feline rests on a rug", 0.8),
    ("I love pizza", "The weather is nice", 0.1),
    ("He plays soccer", "She enjoys football", 0.7),
]

# Convert to DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Cosine similarity loss
train_loss = losses.CosineSimilarityLoss(model)

# Train
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    warmup_steps=100
)

Usage:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
sentences = [
    "I love machine learning",
    "I enjoy artificial intelligence",
    "The weather is sunny today"
]

embeddings = model.encode(sentences)

# Compute similarities
similarity_matrix = util.cos_sim(embeddings, embeddings)

print(similarity_matrix)
# [[1.00, 0.72, 0.13],   # "machine learning" most similar to "AI"
#  [0.72, 1.00, 0.15],
#  [0.13, 0.15, 1.00]]

Popular Models (2023)

Model	Dimensions	Use Case	Performance
all-MiniLM-L6-v2	384	Fast, general purpose	Good
all-mpnet-base-v2	768	Better quality	Very good
instructor-xl	768	Task-specific instructions	Excellent
E5-large	1024	State-of-the-art	Excellent
OpenAI text-embedding-ada-002	1536	Commercial API	Excellent

2026 Update: State-of-the-Art Models

The embedding landscape has evolved significantly. Modern models now offer better quality, longer context windows, and multilingual capabilities:

Model	Dimensions	Context Length	Key Features	Performance
OpenAI text-embedding-3-large	3072 (configurable)	8191 tokens	Best quality, configurable dimensions	Excellent
OpenAI text-embedding-3-small	1536 (configurable)	8191 tokens	Cost-effective, fast	Very good
Cohere embed-v3	1024	512 tokens	Multilingual (100+ languages), compression	Excellent
Voyage-large-2	1536	16000 tokens	Long context, domain-specific variants	Excellent
jina-embeddings-v2-base	768	8192 tokens	Open-source, long context	Very good
NV-Embed-v2	4096	32768 tokens	NVIDIA, longest context	Excellent
bge-m3	1024	8192 tokens	Multi-granularity (dense + sparse + colbert)	Excellent
Alibaba GTE-large	1024	512 tokens	Strong on retrieval benchmarks	Very good

Key trends in 2026:

Longer context: Models now handle 8K-32K tokens vs 512 in 2023.
Configurable dimensions: Trade-off between quality and storage (e.g., OpenAI’s matryoshka embeddings).
Multi-vector representations: Hybrid dense + sparse + late interaction (e.g., BGE-M3).
Domain specialization: Finance, medical, legal-specific models.
Better multilingual: True cross-lingual semantic search across 100+ languages.
Efficiency improvements: 2-3x faster inference with quantization and distillation.

Migration considerations:

Re-embedding existing corpora with newer models can improve retrieval quality by 10-20%.
Backward compatibility: Most vector databases support multiple embedding dimensions.
Cost: OpenAI text-embedding-3-small is 5x cheaper than ada-002 with similar quality.

Similarity Metrics: Measuring Distance

1. Cosine Similarity

Most common for text embeddings:

def cosine_similarity(a, b):
    """
    cos(θ) = (a · b) / (||a|| * ||b||)
    Range: [-1, 1]
    """
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example
vec_a = np.array([1, 2, 3])
vec_b = np.array([2, 4, 6])  # Same direction

similarity = cosine_similarity(vec_a, vec_b)
# Output: 1.0 (identical direction)

Cosine Similarity Visualization

Why cosine?

Magnitude-invariant (focuses on direction).
Range [-1, 1] easy to interpret.
Efficient computation (single dot product).

2. Euclidean Distance

Geometric distance in vector space:

def euclidean_distance(a, b):
    """
    d = √(Σ(a_i - b_i)²)
    Range: [0, ∞)
    """
    return np.sqrt(np.sum((a - b) ** 2))

# Or use numpy
distance = np.linalg.norm(a - b)

When to use:

Normalized embeddings.
Clustering algorithms.
When magnitude matters.

3. Dot Product

Raw similarity without normalization:

def dot_product(a, b):
    """
    a · b = Σ(a_i * b_i)
    Range: (-∞, ∞)
    """
    return np.dot(a, b)

Advantages:

Fastest computation.
No square root or division.
Good for normalized vectors.

4. Manhattan Distance (L1)

Sum of absolute differences:

def manhattan_distance(a, b):
    """
    d = Σ|a_i - b_i|
    """
    return np.sum(np.abs(a - b))

Comparison Example

import numpy as np

# Two embeddings
vec1 = np.array([0.5, 0.8, 0.3])
vec2 = np.array([0.6, 0.7, 0.4])

# Different metrics
cosine = cosine_similarity(vec1, vec2)        # 0.987 (very similar)
euclidean = euclidean_distance(vec1, vec2)    # 0.173 (close)
dot = dot_product(vec1, vec2)                 # 0.98
manhattan = manhattan_distance(vec1, vec2)    # 0.30

Vector Databases: Searching at Scale

The Challenge

With millions of vectors, brute-force search is impractical:

# Brute force: O(n * d) where n = vectors, d = dimensions
def brute_force_search(query, corpus, k=10):
    similarities = []
    for doc_embedding in corpus:  # n iterations
        sim = cosine_similarity(query, doc_embedding)  # d operations
        similarities.append(sim)
    
    # Return top-k
    top_k_indices = np.argsort(similarities)[-k:]
    return top_k_indices

For 1 million 768-dimensional vectors: ~768 million operations per query ❌

Approximate Nearest Neighbor (ANN) Algorithms

Trade perfect accuracy for massive speed improvements.

1. HNSW (Hierarchical Navigable Small World)

Most popular algorithm (used by Elasticsearch, Pinecone, Weaviate):

Level 2: • ─────────── •
         │             │
Level 1: • ──── • ──── • ──── •
         │      │      │      │
Level 0: •─•─•─•─•─•─•─•─•─•─•  (all vectors)

How it works:

Build hierarchical graph of vectors.
Search starts at top level (sparse, long jumps).
Descend levels, refining search.
Bottom level has all vectors.

# Simplified HNSW concept
class HNSW:
    def search(self, query, k=10):
        # Start at top level
        current_node = self.entry_point
        
        for level in range(self.max_level, -1, -1):
            # Greedy search in current level
            while True:
                neighbors = self.get_neighbors(current_node, level)
                closest = min(neighbors, key=lambda n: distance(query, n))
                
                if distance(query, closest) >= distance(query, current_node):
                    break
                current_node = closest
        
        # Refined search at level 0
        return self.get_k_nearest(current_node, query, k)

Performance:

Build time: O(n log n).
Query time: O(log n).
Recall: ~95-99% of exact results.

2. IVF (Inverted File Index)

Partition space into clusters:

# IVF concept
from sklearn.cluster import KMeans

class IVF:
    def __init__(self, n_clusters=100):
        self.n_clusters = n_clusters
        self.kmeans = KMeans(n_clusters=n_clusters)
    
    def build_index(self, vectors):
        # Cluster all vectors
        self.cluster_centers = self.kmeans.fit(vectors)
        
        # Assign each vector to nearest cluster
        self.inverted_lists = [[] for _ in range(self.n_clusters)]
        for i, vec in enumerate(vectors):
            cluster_id = self.kmeans.predict([vec])[0]
            self.inverted_lists[cluster_id].append((i, vec))
    
    def search(self, query, k=10, n_probe=5):
        # Find nearest clusters to query
        cluster_distances = [
            (i, distance(query, center)) 
            for i, center in enumerate(self.cluster_centers)
        ]
        nearest_clusters = sorted(cluster_distances, key=lambda x: x[1])[:n_probe]
        
        # Search only in nearest clusters
        candidates = []
        for cluster_id, _ in nearest_clusters:
            candidates.extend(self.inverted_lists[cluster_id])
        
        # Find top-k in candidates
        similarities = [(i, cosine_similarity(query, vec)) 
                       for i, vec in candidates]
        return sorted(similarities, key=lambda x: x[1], reverse=True)[:k]

3. LSH (Locality-Sensitive Hashing)

Hash similar vectors to same buckets:

# LSH concept
import random

class LSH:
    def __init__(self, n_hash_functions=10, n_buckets=1000):
        self.n_hash_functions = n_hash_functions
        self.hash_tables = [{} for _ in range(n_hash_functions)]
        
        # Random hyperplanes for hashing
        self.hyperplanes = [
            np.random.randn(embedding_dim) 
            for _ in range(n_hash_functions)
        ]
    
    def hash_vector(self, vec, hyperplane):
        # Project onto hyperplane, threshold at 0
        return int(np.dot(vec, hyperplane) > 0)
    
    def insert(self, vec_id, vec):
        for i, hyperplane in enumerate(self.hyperplanes):
            hash_val = self.hash_vector(vec, hyperplane)
            if hash_val not in self.hash_tables[i]:
                self.hash_tables[i][hash_val] = []
            self.hash_tables[i][hash_val].append((vec_id, vec))
    
    def search(self, query, k=10):
        # Find candidates from all hash tables
        candidates = set()
        for i, hyperplane in enumerate(self.hyperplanes):
            hash_val = self.hash_vector(query, hyperplane)
            if hash_val in self.hash_tables[i]:
                candidates.update(self.hash_tables[i][hash_val])
        
        # Compute exact similarities for candidates
        similarities = [(id, cosine_similarity(query, vec)) 
                       for id, vec in candidates]
        return sorted(similarities, key=lambda x: x[1], reverse=True)[:k]

Elasticsearch for Vector Search

Elasticsearch 8.0+ has native vector search support (kNN).

Index Setup

PUT /my-embeddings
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "content": {
        "type": "text"
      },
      "embedding": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

Elasticsearch Vector Search Architecture

Indexing Documents

from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

es = Elasticsearch(['http://localhost:9200'])
model = SentenceTransformer('all-MiniLM-L6-v2')

# Index documents with embeddings
documents = [
    {"title": "Machine Learning Basics", "content": "ML is a subset of AI..."},
    {"title": "Deep Learning Guide", "content": "Neural networks are..."},
    {"title": "NLP Tutorial", "content": "Natural language processing..."}
]

for i, doc in enumerate(documents):
    # Generate embedding
    embedding = model.encode(doc['content']).tolist()
    
    # Index document
    es.index(
        index='my-embeddings',
        id=i,
        document={
            'title': doc['title'],
            'content': doc['content'],
            'embedding': embedding
        }
    )

Semantic Search Query

def semantic_search(query_text, k=10):
    # Generate query embedding
    query_embedding = model.encode(query_text).tolist()
    
    # kNN search
    response = es.search(
        index='my-embeddings',
        knn={
            'field': 'embedding',
            'query_vector': query_embedding,
            'k': k,
            'num_candidates': 100  # Number of candidates to consider
        }
    )
    
    return [hit['_source'] for hit in response['hits']['hits']]

# Query
results = semantic_search("How do neural networks work?")
for doc in results:
    print(f"{doc['title']}: {doc['content'][:100]}...")

Hybrid Search: Combining Dense and Sparse

Best results often come from combining semantic (dense) and keyword (sparse) search:

def hybrid_search(query_text, k=10, dense_weight=0.7):
    query_embedding = model.encode(query_text).tolist()
    
    response = es.search(
        index='my-embeddings',
        query={
            'bool': {
                'should': [
                    # Semantic search (dense vectors)
                    {
                        'knn': {
                            'field': 'embedding',
                            'query_vector': query_embedding,
                            'k': k,
                            'boost': dense_weight
                        }
                    },
                    # Keyword search (BM25)
                    {
                        'multi_match': {
                            'query': query_text,
                            'fields': ['title^2', 'content'],
                            'boost': 1 - dense_weight
                        }
                    }
                ]
            }
        },
        size=k
    )
    
    return response['hits']['hits']

Performance Tuning

# Index settings for optimal performance
PUT /my-embeddings
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 1,
      "knn": true,
      "knn.algo_param.ef_construction": 100  # HNSW build quality
    }
  },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "hnsw",
          "m": 16,           # Number of connections per node
          "ef_construction": 100  # Build-time accuracy
        }
      }
    }
  }
}

Dense vs Sparse Vectors

Dense Vectors

Characteristics:

Most values are non-zero.
Learned from neural networks.
Capture semantic meaning.
Typical dimensions: 128-1536.

Example:

dense = [0.23, -0.45, 0.12, 0.89, -0.34, ...]  # 384 dimensions

Advantages:

Capture semantic similarity.
Generalize to synonyms and paraphrases.
Work across languages (multilingual models).

Disadvantages:

Computationally expensive.
Black box (hard to interpret).
May miss exact keyword matches.

Sparse Vectors

Characteristics:

Most values are zero.
Explicit features (e.g., TF-IDF, BM25).
Typically very high dimensional.
Interpretable.

Example:

# TF-IDF vector (vocab size = 50,000)
sparse = {
    4523: 0.8,   # "machine"
    8901: 0.6,   # "learning"
    15234: 0.4   # "algorithm"
}  # Only 3 non-zero values out of 50,000

Advantages:

Fast exact keyword matching.
Interpretable (see which words match).
Good for domain-specific terms.

Disadvantages:

No semantic understanding.
Vocabulary mismatch problem.
Language-specific.

SPLADE: Best of Both Worlds

SPLADE learns sparse vectors using neural networks:

from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained('naver/splade-cocondenser-ensembledistil')
tokenizer = AutoTokenizer.from_pretrained('naver/splade-cocondenser-ensembledistil')

def compute_splade_vector(text):
    inputs = tokenizer(text, return_tensors='pt')
    
    with torch.no_grad():
        logits = model(**inputs).logits
    
    # ReLU + log(1 + x) activation
    weights = torch.nn.functional.relu(logits)
    weights = torch.log(1 + weights)
    
    # Max pooling over tokens
    sparse_vector = torch.max(weights, dim=1).values.squeeze()
    
    # Keep only top-k dimensions
    top_k = torch.topk(sparse_vector, k=100)
    
    return {int(idx): float(val) for idx, val in zip(top_k.indices, top_k.values)}

Practical Applications

1. Semantic Search Engine

class SemanticSearchEngine:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None
    
    def index_documents(self, documents):
        """Index a corpus of documents"""
        self.documents = documents
        self.embeddings = self.model.encode(
            documents,
            convert_to_tensor=True,
            show_progress_bar=True
        )
    
    def search(self, query, top_k=5):
        """Search for similar documents"""
        query_embedding = self.model.encode(query, convert_to_tensor=True)
        
        # Compute cosine similarities
        similarities = util.cos_sim(query_embedding, self.embeddings)[0]
        
        # Get top-k results
        top_results = torch.topk(similarities, k=min(top_k, len(self.documents)))
        
        results = []
        for score, idx in zip(top_results.values, top_results.indices):
            results.append({
                'document': self.documents[idx],
                'score': float(score),
                'index': int(idx)
            })
        
        return results

# Usage
engine = SemanticSearchEngine()
engine.index_documents([
    "Python is a programming language",
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "Natural language processing analyzes text"
])

results = engine.search("What is AI?")
for result in results:
    print(f"Score: {result['score']:.3f} | {result['document']}")

2. Duplicate Detection

def find_duplicates(texts, threshold=0.85):
    """Find semantically similar/duplicate texts"""
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(texts)
    
    duplicates = []
    for i in range(len(texts)):
        for j in range(i + 1, len(texts)):
            similarity = cosine_similarity(embeddings[i], embeddings[j])
            if similarity > threshold:
                duplicates.append({
                    'text1': texts[i],
                    'text2': texts[j],
                    'similarity': similarity
                })
    
    return duplicates

# Example
texts = [
    "The cat sat on the mat",
    "A feline rested on the rug",  # Similar
    "I love pizza",
    "The weather is nice"
]

duplicates = find_duplicates(texts)
for dup in duplicates:
    print(f"Similarity: {dup['similarity']:.3f}")
    print(f"  1: {dup['text1']}")
    print(f"  2: {dup['text2']}\n")

3. Recommendation System

class ContentRecommender:
    def __init__(self, items, descriptions):
        self.items = items
        self.descriptions = descriptions
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.embeddings = self.model.encode(descriptions)
    
    def recommend_similar(self, item_index, top_k=5):
        """Recommend items similar to given item"""
        item_embedding = self.embeddings[item_index]
        
        similarities = [
            cosine_similarity(item_embedding, emb)
            for emb in self.embeddings
        ]
        
        # Exclude the item itself
        similarities[item_index] = -1
        
        # Get top-k
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        recommendations = [
            {
                'item': self.items[idx],
                'description': self.descriptions[idx],
                'similarity': similarities[idx]
            }
            for idx in top_indices
        ]
        
        return recommendations

# Usage
items = ["Product A", "Product B", "Product C", "Product D"]
descriptions = [
    "Wireless headphones with noise cancellation",
    "Bluetooth earbuds for sports",
    "Over-ear studio headphones",
    "Gaming keyboard with RGB lighting"
]

recommender = ContentRecommender(items, descriptions)
recommendations = recommender.recommend_similar(0, top_k=2)

print("Similar to 'Wireless headphones':")
for rec in recommendations:
    print(f"  {rec['item']} (similarity: {rec['similarity']:.3f})")

4. Clustering and Topic Discovery

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

def cluster_documents(texts, n_clusters=3):
    """Cluster documents by semantic similarity"""
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(texts)
    
    # K-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(embeddings)
    
    # Organize by cluster
    clustered_texts = {i: [] for i in range(n_clusters)}
    for text, cluster in zip(texts, clusters):
        clustered_texts[cluster].append(text)
    
    return clustered_texts, embeddings, kmeans

# Visualize with t-SNE
from sklearn.manifold import TSNE

def visualize_clusters(embeddings, clusters):
    # Reduce to 2D
    tsne = TSNE(n_components=2, random_state=42)
    embeddings_2d = tsne.fit_transform(embeddings)
    
    # Plot
    plt.figure(figsize=(10, 8))
    scatter = plt.scatter(
        embeddings_2d[:, 0],
        embeddings_2d[:, 1],
        c=clusters,
        cmap='viridis',
        s=50
    )
    plt.colorbar(scatter)
    plt.title("Document Clusters (t-SNE)")
    plt.show()

Production Considerations

1. Model Selection

Tradeoffs:

Quality vs Speed: Larger models (768d) vs smaller (384d).
Domain specificity: General vs domain-trained models.
Cost: Self-hosted vs API (OpenAI, Cohere).

2. Caching Embeddings

import pickle
from pathlib import Path

class EmbeddingCache:
    def __init__(self, cache_dir='./embedding_cache'):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
    
    def get_cache_path(self, text):
        # Hash text for filename
        import hashlib
        hash_key = hashlib.md5(text.encode()).hexdigest()
        return self.cache_dir / f"{hash_key}.pkl"
    
    def get(self, text):
        cache_path = self.get_cache_path(text)
        if cache_path.exists():
            with open(cache_path, 'rb') as f:
                return pickle.load(f)
        return None
    
    def set(self, text, embedding):
        cache_path = self.get_cache_path(text)
        with open(cache_path, 'wb') as f:
            pickle.dump(embedding, f)
    
    def get_or_compute(self, text, model):
        embedding = self.get(text)
        if embedding is None:
            embedding = model.encode(text)
            self.set(text, embedding)
        return embedding

3. Batch Processing

def batch_embed_documents(documents, model, batch_size=32):
    """Efficiently embed large document collections"""
    embeddings = []
    
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        batch_embeddings = model.encode(
            batch,
            batch_size=batch_size,
            show_progress_bar=True,
            convert_to_numpy=True
        )
        embeddings.append(batch_embeddings)
    
    return np.vstack(embeddings)

4. Monitoring and Evaluation

class SearchQualityMonitor:
    def __init__(self):
        self.queries = []
        self.results = []
        self.relevance_feedback = []
    
    def log_query(self, query, results, clicked_index=None):
        self.queries.append(query)
        self.results.append(results)
        self.relevance_feedback.append(clicked_index)
    
    def compute_mrr(self):
        """Mean Reciprocal Rank"""
        reciprocal_ranks = []
        for clicked_idx in self.relevance_feedback:
            if clicked_idx is not None:
                reciprocal_ranks.append(1.0 / (clicked_idx + 1))
            else:
                reciprocal_ranks.append(0.0)
        return np.mean(reciprocal_ranks)
    
    def compute_click_through_rate(self):
        """Percentage of queries with clicks"""
        clicks = sum(1 for idx in self.relevance_feedback if idx is not None)
        return clicks / len(self.relevance_feedback)

Challenges and Limitations

1. Out-of-Domain Performance

Embeddings trained on general text may not work well for specialized domains:

# General model
general_model = SentenceTransformer('all-MiniLM-L6-v2')

# May struggle with:
# - Medical terminology
# - Legal documents
# - Code snippets
# - Domain-specific jargon

# Solution: Fine-tune on domain data
from sentence_transformers import InputExample

train_examples = [
    InputExample(texts=['myocardial infarction', 'heart attack'], label=0.9),
    InputExample(texts=['hypertension', 'high blood pressure'], label=0.9),
    # ... domain-specific pairs
]

# Fine-tune
model.fit(train_examples)

2. Multilingual Challenges

# English query on Spanish documents may fail
query_en = "machine learning"
doc_es = "aprendizaje automático"

# Solution: Use multilingual models
multilingual_model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

emb_en = multilingual_model.encode(query_en)
emb_es = multilingual_model.encode(doc_es)

similarity = cosine_similarity(emb_en, emb_es)  # Should be high

3. Computational Cost

# Embedding 1M documents with 768-dim model
documents = ["..." for _ in range(1_000_000)]

# Sequential: ~10 hours on CPU ❌
embeddings = model.encode(documents)

# Optimized: ~30 minutes on GPU ✅
embeddings = model.encode(
    documents,
    batch_size=128,
    show_progress_bar=True,
    device='cuda'
)

4. Long Document Handling

Most models have token limits (512 tokens):

def embed_long_document(document, model, max_length=512):
    """Strategy 1: Truncate"""
    tokens = tokenizer(document, max_length=max_length, truncation=True)
    return model.encode(tokens)

def embed_long_document_chunks(document, model, chunk_size=512):
    """Strategy 2: Chunk and average"""
    chunks = [document[i:i+chunk_size] for i in range(0, len(document), chunk_size)]
    chunk_embeddings = model.encode(chunks)
    return np.mean(chunk_embeddings, axis=0)

def embed_long_document_hierarchical(document, model):
    """Strategy 3: Hierarchical (chunk → summarize → embed)"""
    chunks = split_into_chunks(document)
    summaries = [summarize(chunk) for chunk in chunks]
    return model.encode(" ".join(summaries))

Future Directions

1. Late Interaction Models (ColBERT)

Instead of single vector per document, use multiple vectors:

Query:     [q1, q2, q3]        (token-level vectors)
Document:  [d1, d2, d3, d4, d5] (token-level vectors)

Score = Σ max(q_i · d_j)  for all i, j

2. Learned Sparse Retrieval

Neural models that output sparse vectors (SPLADE, SPARTA)

3. Cross-Encoder Re-ranking

Two-stage retrieval:

Fast approximate search (bi-encoder)
Accurate re-ranking (cross-encoder)

from sentence_transformers import CrossEncoder

# Stage 1: Retrieve candidates
candidates = fast_vector_search(query, top_k=100)

# Stage 2: Re-rank with cross-encoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc) for doc in candidates])

# Return top-k after re-ranking
top_k = np.argsort(scores)[-10:][::-1]

Conclusion

Vector embeddings have revolutionized how we represent and search text. From the breakthrough of Word2Vec to modern transformer-based models, embeddings enable machines to understand semantic meaning rather than just matching keywords.

Key takeaways:

Embeddings map text to vectors where semantic similarity → vector proximity
Evolution: Word2Vec → GloVe → ELMo → BERT → Sentence Transformers
Similarity metrics: Cosine similarity most common for text
ANN algorithms: HNSW, IVF enable fast search at scale
Elasticsearch 8.0+: Native vector search with kNN
Hybrid search: Combine dense (semantic) + sparse (keyword) for best results
Production considerations: Caching, batching, monitoring essential

Vector embeddings are the foundation for modern AI applications: from RAG systems that ground LLMs in factual data, to recommendation engines that understand user preferences, to semantic search that captures intent beyond keywords.

In our next post, we’ll explore Prompt Engineering, examining how to effectively communicate with large language models to get optimal results.

References

Mikolov et al. (2013): “Efficient Estimation of Word Representations in Vector Space” (Word2Vec).
Pennington et al. (2014): “GloVe: Global Vectors for Word Representation”.
Peters et al. (2018): “Deep contextualized word representations” (ELMo).
Reimers & Gurevych (2019): “Sentence-BERT”.
Johnson et al. (2019): “Billion-scale similarity search with GPUs” (FAISS).
Formal et al. (2021): “SPLADE”.
Elasticsearch Vector Search Documentation.

About this series: This is the second post in a series exploring AI Engineering. Previously, we covered the fundamentals of Large Language Models. Next up: Prompt Engineering techniques.