Back to blog
30 min read

Building a Semantic Search Engine with Transformers and FAISS

AI/MLNLPPythonSemantic Search

What You'll Build

A semantic search engine that finds research papers by meaning, not keywords. Search for "attention mechanism in neural networks" and find relevant papers even if they don't contain those exact words.

By the end of this tutorial, you will:

  • Generate 768-dimensional embeddings for 41,000 ML papers
  • Build a GPU-accelerated similarity search index with FAISS (Facebook AI Similarity Search)
  • Query papers in near real-time by semantic similarity

Prerequisites

Required:

  • Python 3.9+
  • NVIDIA GPU with CUDA 12.4+ (for GPU acceleration)
  • 8GB+ RAM
  • Git LFS installed

Knowledge assumed:

  • Basic Python and pandas
  • Familiarity with machine learning concepts

Tech Stack

ComponentPurpose
PyTorch + CUDAGPU-accelerated deep learning
MPNetGenerates 768-dimensional text embeddings
FAISS-GPUFast similarity search across vectors
PandasDataset loading and manipulation

Step 1: Clone and Set Up Environment

git lfs install
git clone https://github.com/sheygs/semantic-search.git
cd semantic-search

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Why Git LFS? The dataset and pre-computed embeddings exceed GitHub's 100MB file limit. Git LFS downloads these large files separately.

Expected result: Repository cloned with data/ folder containing research_papers.json.


Step 2: Install Dependencies

pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Or install packages individually:

pip install "sentence-transformers>=5.0.0" "faiss-gpu-cu12>=1.7.2"
pip install "pandas>=2.3.0" "scikit-learn>=1.6.0" "numpy<2.0"
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
PackagePurpose
sentence-transformersConverts text to semantic embeddings
faiss-gpu-cu12GPU-accelerated similarity search (CUDA 12)
numpy<2.0Pinned to avoid FAISS compatibility issues

Step 3: Import Libraries and Verify GPU

import pickle
import pandas as pd
import torch
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from sklearn import preprocessing

# Verify GPU access
print(f"GPUs available: {faiss.get_num_gpus()}")

Output:

GPUs available: 1

If GPU count is 0: FAISS falls back to CPU, which is significantly slower. Check your CUDA installation.


Step 4: Load the Dataset

df = pd.read_json("data/research_papers.json")
df = df.drop(["author", "link", "tag"], axis=1)

print(f"Number of research papers: {len(df)}")
df.head()

Output:

Number of research papers: 41000
idtitleyear
1802.00209v1Dual Recurrent Attention Units for Visual Question Answering2018
1603.03827v1Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks2016
1606.00776v2Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation2016
1705.08142v2Learning what to share between loosely related tasks2017
1709.02349v2A Deep Reinforcement Learning Chatbot2017

The dataset contains 41,000 ML research papers with id, title, summary, year, month, and day columns. We drop metadata columns (author, link, tag) not needed for search.


Step 5: Load the Embedding Model

model = SentenceTransformer('all-mpnet-base-v2')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

print(f"model: {model}\n device: {device}")

Output:

model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False, 'architecture': 'MPNetModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True, ...})
  (2): Normalize()
)
 device: cuda

Model breakdown:

  • MPNet (all-mpnet-base-v2) — High-accuracy semantic embeddings. Uses a hybrid of masked and permuted language modeling for superior context awareness.
  • all-MiniLM-L6-v2 (alternative) — Optimized for high-speed inference. Maintains 95% of MPNet's performance at lower latency.
  • Mean pooling — Aggregates token embeddings into a fixed-size 768-dimensional vector that represents the sentence's semantic meaning.

Step 6: Generate Embeddings

Option A: Generate new embeddings (~40 seconds on GPU)

embeddings = model.encode(df.summary.to_list(), show_progress_bar=True)

# Save for future use
with open('data/new_embeddings.pickle', 'wb') as pkl:
    pickle.dump(embeddings, pkl)

print(f"Shape: {embeddings.shape}")

Output:

Shape: (41000, 768)

Option B: Load pre-computed embeddings

def load_embeddings(file_path, mode='rb'):
    with open(file_path, mode) as f:
        embeddings = pickle.load(f)
        return embeddings, len(embeddings), embeddings.shape

embeddings, length, shape = load_embeddings('data/new_embeddings.pickle')
print(f"embeddings: {embeddings[0]}\n length: {length}\n shape: {shape}")
print(f"Is instance of numpy arrays :{isinstance(embeddings, np.ndarray)}")

Output:

embeddings: [-1.09981090e-01  1.64143533e-01  6.77780509e-01 ...]
 length: 41000
 shape: (41000, 768)
Is instance of numpy arrays :True

Each paper summary becomes a 768-dimensional vector. Similar concepts produce similar vectors, even with different wording. This is what makes semantic search possible — the model captures meaning, not just keywords.


Step 7: Prepare Data for FAISS

label_encoder = preprocessing.LabelEncoder()
print(f"Data type before encoding: {df['id'].dtype}")

df['encoded_id'] = label_encoder.fit_transform(df['id'])
print(f"Data type after encoding: {df['encoded_id'].dtype}")
df.head()

Output:

Data type before encoding: object
Data type after encoding: int64
idtitleyearencoded_id
1802.00209v1Dual Recurrent Attention Units for Visual Question Answering201836693
1603.03827v1Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks201618198
1606.00776v2Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation201619318
1705.08142v2Learning what to share between loosely related tasks201727779
1709.02349v2A Deep Reinforcement Learning Chatbot201731468

FAISS uses integer indices internally. LabelEncoder maps string paper IDs (like 1802.00209v1) to integers (0 to 40,999).


Step 8: Build the FAISS Index

# Convert embeddings to float32 numpy array
embeddings_np = np.array(embeddings, dtype=np.float32)

def create_gpu_index(embedding_dim):
    num_gpus = faiss.get_num_gpus()
    print(f"Faiss detected {num_gpus} GPU(s)")

    if num_gpus == 0:
        print("No GPU available for Faiss, falling back to CPU index")
        index = faiss.IndexFlatL2(embedding_dim)
        return faiss.IndexIDMap(index)

    res = faiss.StandardGpuResources()
    config = faiss.GpuIndexFlatConfig()
    config.device = 0

    gpu_index = faiss.GpuIndexFlatL2(res, embedding_dim, config)
    return faiss.IndexIDMap(gpu_index)

gpu_index_map = create_gpu_index(embeddings_np.shape[1])

Output:

Faiss detected 1 GPU(s)

Add all embeddings with their encoded IDs:

gpu_index_map.add_with_ids(
    embeddings_np, df["encoded_id"][:length].values.astype("int64")
)

print(f"Number of embeddings in the Faiss index: {gpu_index_map.ntotal}")

Output:

Number of embeddings in the Faiss index: 41000

What each step does:

  1. GpuIndexFlatL2 — Creates an exact L2 distance search index on GPU 0
  2. IndexIDMap — Wraps the index to support custom IDs (encoded paper IDs)
  3. add_with_ids() — Inserts all 41,000 embeddings with their corresponding paper IDs
  4. CPU fallback — Automatically falls back to CPU if no GPU is detected

Step 9: Implement Search Helpers

Define a helper function to retrieve paper information from search results:

def id_to_info(df, I, column):
    return [list(df[df['encoded_id'] == idx][column]) for idx in I]

This maps encoded IDs returned by FAISS back to paper metadata (titles, summaries) in the DataFrame.


Step 10: Test the Search

Search by natural language query:

Use the abstract from the "Attention Is All You Need" paper as a search query:

query = "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data"

embed = model.encode([query])
print(f"embed shape: {embed.shape}")

Output:

embed shape: (1, 768)

Search the index for the 10 nearest neighbors:

D, I = gpu_index_map.search(embed.astype("float32"), k=10)

results = {
    'L2 distances': D.flatten().tolist(),
    'ML paper IDs': I.flatten().tolist(),
    'Titles': id_to_info(df, I.flatten(), 'title'),
    'Summaries': id_to_info(df, I.flatten(), 'summary')
}

pd.DataFrame(results).head(10)

Output:

L2 distanceTitle
121.83Feature Representation for ICU Mortality
128.81A Model of the Mechanisms Underlying Exploratory Behaviour
129.73Gibbs Sampling in Open-Universe Stochastic Languages
130.12The Voynich Manuscript is Written in Natural Language: The Pahlavi Hypothesis
130.54Leveraging Unstructured Data to Detect Emerging Reliability Issues

Note: Raw L2 distances are not normalized here, so lower values indicate closer matches. For production use, consider normalizing embeddings with faiss.normalize_L2() before indexing to convert L2 distances to cosine similarity scores.


Verification Checklist

  • faiss.get_num_gpus() returns ≥1
  • Embeddings shape is (41000, 768)
  • Index contains 41,000 vectors (gpu_index_map.ntotal == 41000)
  • Search returns papers with L2 distances

Performance Summary

OperationTime
Embedding generation (41K papers)~40 seconds
Search query~50-100ms
Load pre-computed embeddings~2 seconds

Troubleshooting

"No GPU available" (faiss.get_num_gpus() returns 0)

  • Verify CUDA installation: nvidia-smi
  • Reinstall faiss-gpu: pip uninstall faiss-gpu-cu12 && pip install faiss-gpu-cu12
  • Check PyTorch CUDA: torch.cuda.is_available()

"NumPy version incompatibility"

  • Pin NumPy: pip install "numpy<2.0"

"Out of memory" during embedding generation

  • Reduce batch size: model.encode(..., batch_size=32)
  • Use CPU fallback if GPU memory is limited

Next Steps

  • Scale up: Use IndexIVFFlat for datasets with millions of vectors
  • Add filtering: Combine semantic search with metadata filters (year, author)
  • Deploy: Wrap in FastAPI for a production-ready API

Repository

Complete implementation: github.com/sheygs/semantic-search


Questions about semantic search? Feel free to reach out.