January 26, 2026•30 min read

Building a Semantic Search Engine with Transformers and FAISS

AI/MLNLPPythonSemantic Search

What You'll Build

A semantic search engine that finds research papers by meaning, not keywords. Search for "attention mechanism in neural networks" and find relevant papers even if they don't contain those exact words.

By the end of this tutorial, you will:

Generate 768-dimensional embeddings for 41,000 ML papers
Build a GPU-accelerated similarity search index with FAISS (Facebook AI Similarity Search)
Query papers in near real-time by semantic similarity

Prerequisites

Required:

Python 3.9+
NVIDIA GPU with CUDA 12.4+ (for GPU acceleration)
8GB+ RAM
Git LFS installed

Knowledge assumed:

Basic Python and pandas
Familiarity with machine learning concepts

Tech Stack

Component	Purpose
PyTorch + CUDA	GPU-accelerated deep learning
MPNet	Generates 768-dimensional text embeddings
FAISS-GPU	Fast similarity search across vectors
Pandas	Dataset loading and manipulation

Step 1: Clone and Set Up Environment

git lfs install
git clone https://github.com/sheygs/semantic-search.git
cd semantic-search

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Why Git LFS? The dataset and pre-computed embeddings exceed GitHub's 100MB file limit. Git LFS downloads these large files separately.

Expected result: Repository cloned with data/ folder containing research_papers.json.

Step 2: Install Dependencies

pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Or install packages individually:

pip install "sentence-transformers>=5.0.0" "faiss-gpu-cu12>=1.7.2"
pip install "pandas>=2.3.0" "scikit-learn>=1.6.0" "numpy<2.0"
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Package	Purpose
`sentence-transformers`	Converts text to semantic embeddings
`faiss-gpu-cu12`	GPU-accelerated similarity search (CUDA 12)
`numpy<2.0`	Pinned to avoid FAISS compatibility issues

Step 3: Import Libraries and Verify GPU

import pickle
import pandas as pd
import torch
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from sklearn import preprocessing

# Verify GPU access
print(f"GPUs available: {faiss.get_num_gpus()}")

Output:

GPUs available: 1

If GPU count is 0: FAISS falls back to CPU, which is significantly slower. Check your CUDA installation.

Step 4: Load the Dataset

df = pd.read_json("data/research_papers.json")
df = df.drop(["author", "link", "tag"], axis=1)

print(f"Number of research papers: {len(df)}")
df.head()

Output:

Number of research papers: 41000

id	title	year
1802.00209v1	Dual Recurrent Attention Units for Visual Question Answering	2018
1603.03827v1	Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks	2016
1606.00776v2	Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation	2016
1705.08142v2	Learning what to share between loosely related tasks	2017
1709.02349v2	A Deep Reinforcement Learning Chatbot	2017

The dataset contains 41,000 ML research papers with id, title, summary, year, month, and day columns. We drop metadata columns (author, link, tag) not needed for search.

Step 5: Load the Embedding Model

model = SentenceTransformer('all-mpnet-base-v2')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

print(f"model: {model}\n device: {device}")

Output:

model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False, 'architecture': 'MPNetModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True, ...})
  (2): Normalize()
)
 device: cuda

Model breakdown:

MPNet (all-mpnet-base-v2) — High-accuracy semantic embeddings. Uses a hybrid of masked and permuted language modeling for superior context awareness.
all-MiniLM-L6-v2 (alternative) — Optimized for high-speed inference. Maintains 95% of MPNet's performance at lower latency.
Mean pooling — Aggregates token embeddings into a fixed-size 768-dimensional vector that represents the sentence's semantic meaning.

Step 6: Generate Embeddings

Option A: Generate new embeddings (~40 seconds on GPU)

embeddings = model.encode(df.summary.to_list(), show_progress_bar=True)

# Save for future use
with open('data/new_embeddings.pickle', 'wb') as pkl:
    pickle.dump(embeddings, pkl)

print(f"Shape: {embeddings.shape}")

Output:

Shape: (41000, 768)

Option B: Load pre-computed embeddings

def load_embeddings(file_path, mode='rb'):
    with open(file_path, mode) as f:
        embeddings = pickle.load(f)
        return embeddings, len(embeddings), embeddings.shape

embeddings, length, shape = load_embeddings('data/new_embeddings.pickle')
print(f"embeddings: {embeddings[0]}\n length: {length}\n shape: {shape}")
print(f"Is instance of numpy arrays :{isinstance(embeddings, np.ndarray)}")

Output:

embeddings: [-1.09981090e-01  1.64143533e-01  6.77780509e-01 ...]
 length: 41000
 shape: (41000, 768)
Is instance of numpy arrays :True

Each paper summary becomes a 768-dimensional vector. Similar concepts produce similar vectors, even with different wording. This is what makes semantic search possible — the model captures meaning, not just keywords.

Step 7: Prepare Data for FAISS

label_encoder = preprocessing.LabelEncoder()
print(f"Data type before encoding: {df['id'].dtype}")

df['encoded_id'] = label_encoder.fit_transform(df['id'])
print(f"Data type after encoding: {df['encoded_id'].dtype}")
df.head()

Output:

Data type before encoding: object
Data type after encoding: int64

id	title	year	encoded_id
1802.00209v1	Dual Recurrent Attention Units for Visual Question Answering	2018	36693
1603.03827v1	Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks	2016	18198
1606.00776v2	Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation	2016	19318
1705.08142v2	Learning what to share between loosely related tasks	2017	27779
1709.02349v2	A Deep Reinforcement Learning Chatbot	2017	31468

FAISS uses integer indices internally. LabelEncoder maps string paper IDs (like 1802.00209v1) to integers (0 to 40,999).

Step 8: Build the FAISS Index

# Convert embeddings to float32 numpy array
embeddings_np = np.array(embeddings, dtype=np.float32)

def create_gpu_index(embedding_dim):
    num_gpus = faiss.get_num_gpus()
    print(f"Faiss detected {num_gpus} GPU(s)")

    if num_gpus == 0:
        print("No GPU available for Faiss, falling back to CPU index")
        index = faiss.IndexFlatL2(embedding_dim)
        return faiss.IndexIDMap(index)

    res = faiss.StandardGpuResources()
    config = faiss.GpuIndexFlatConfig()
    config.device = 0

    gpu_index = faiss.GpuIndexFlatL2(res, embedding_dim, config)
    return faiss.IndexIDMap(gpu_index)

gpu_index_map = create_gpu_index(embeddings_np.shape[1])

Output:

Faiss detected 1 GPU(s)

Add all embeddings with their encoded IDs:

gpu_index_map.add_with_ids(
    embeddings_np, df["encoded_id"][:length].values.astype("int64")
)

print(f"Number of embeddings in the Faiss index: {gpu_index_map.ntotal}")

Output:

Number of embeddings in the Faiss index: 41000

What each step does:

GpuIndexFlatL2 — Creates an exact L2 distance search index on GPU 0
IndexIDMap — Wraps the index to support custom IDs (encoded paper IDs)
add_with_ids() — Inserts all 41,000 embeddings with their corresponding paper IDs
CPU fallback — Automatically falls back to CPU if no GPU is detected

Step 9: Implement Search Helpers

Define a helper function to retrieve paper information from search results:

def id_to_info(df, I, column):
    return [list(df[df['encoded_id'] == idx][column]) for idx in I]

This maps encoded IDs returned by FAISS back to paper metadata (titles, summaries) in the DataFrame.

Step 10: Test the Search

Search by natural language query:

Use the abstract from the "Attention Is All You Need" paper as a search query:

query = "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data"

embed = model.encode([query])
print(f"embed shape: {embed.shape}")

Output:

embed shape: (1, 768)

Search the index for the 10 nearest neighbors:

D, I = gpu_index_map.search(embed.astype("float32"), k=10)

results = {
    'L2 distances': D.flatten().tolist(),
    'ML paper IDs': I.flatten().tolist(),
    'Titles': id_to_info(df, I.flatten(), 'title'),
    'Summaries': id_to_info(df, I.flatten(), 'summary')
}

pd.DataFrame(results).head(10)

Output:

L2 distance	Title
121.83	Feature Representation for ICU Mortality
128.81	A Model of the Mechanisms Underlying Exploratory Behaviour
129.73	Gibbs Sampling in Open-Universe Stochastic Languages
130.12	The Voynich Manuscript is Written in Natural Language: The Pahlavi Hypothesis
130.54	Leveraging Unstructured Data to Detect Emerging Reliability Issues

Note: Raw L2 distances are not normalized here, so lower values indicate closer matches. For production use, consider normalizing embeddings with faiss.normalize_L2() before indexing to convert L2 distances to cosine similarity scores.

Verification Checklist

faiss.get_num_gpus() returns ≥1
Embeddings shape is (41000, 768)
Index contains 41,000 vectors (gpu_index_map.ntotal == 41000)
Search returns papers with L2 distances

Performance Summary

Operation	Time
Embedding generation (41K papers)	~40 seconds
Search query	~50-100ms
Load pre-computed embeddings	~2 seconds

Troubleshooting

"No GPU available" (faiss.get_num_gpus() returns 0)

Verify CUDA installation: nvidia-smi
Reinstall faiss-gpu: pip uninstall faiss-gpu-cu12 && pip install faiss-gpu-cu12
Check PyTorch CUDA: torch.cuda.is_available()

"NumPy version incompatibility"

Pin NumPy: pip install "numpy<2.0"

"Out of memory" during embedding generation

Reduce batch size: model.encode(..., batch_size=32)
Use CPU fallback if GPU memory is limited

Next Steps

Scale up: Use IndexIVFFlat for datasets with millions of vectors
Add filtering: Combine semantic search with metadata filters (year, author)
Deploy: Wrap in FastAPI for a production-ready API

Repository

Complete implementation: github.com/sheygs/semantic-search

Questions about semantic search? Feel free to reach out.