Building a Semantic Search Engine with Transformers and FAISS
What You'll Build
A semantic search engine that finds research papers by meaning, not keywords. Search for "attention mechanism in neural networks" and find relevant papers even if they don't contain those exact words.
By the end of this tutorial, you will:
- Generate 768-dimensional embeddings for 41,000 ML papers
- Build a GPU-accelerated similarity search index with FAISS (Facebook AI Similarity Search)
- Query papers in near real-time by semantic similarity
Prerequisites
Required:
- Python 3.9+
- NVIDIA GPU with CUDA 12.4+ (for GPU acceleration)
- 8GB+ RAM
- Git LFS installed
Knowledge assumed:
- Basic Python and pandas
- Familiarity with machine learning concepts
Tech Stack
| Component | Purpose |
|---|---|
| PyTorch + CUDA | GPU-accelerated deep learning |
| MPNet | Generates 768-dimensional text embeddings |
| FAISS-GPU | Fast similarity search across vectors |
| Pandas | Dataset loading and manipulation |
Step 1: Clone and Set Up Environment
git lfs install
git clone https://github.com/sheygs/semantic-search.git
cd semantic-search
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
Why Git LFS? The dataset and pre-computed embeddings exceed GitHub's 100MB file limit. Git LFS downloads these large files separately.
Expected result: Repository cloned with data/ folder containing research_papers.json.
Step 2: Install Dependencies
pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Or install packages individually:
pip install "sentence-transformers>=5.0.0" "faiss-gpu-cu12>=1.7.2"
pip install "pandas>=2.3.0" "scikit-learn>=1.6.0" "numpy<2.0"
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
| Package | Purpose |
|---|---|
sentence-transformers | Converts text to semantic embeddings |
faiss-gpu-cu12 | GPU-accelerated similarity search (CUDA 12) |
numpy<2.0 | Pinned to avoid FAISS compatibility issues |
Step 3: Import Libraries and Verify GPU
import pickle
import pandas as pd
import torch
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from sklearn import preprocessing
# Verify GPU access
print(f"GPUs available: {faiss.get_num_gpus()}")
Output:
GPUs available: 1
If GPU count is 0: FAISS falls back to CPU, which is significantly slower. Check your CUDA installation.
Step 4: Load the Dataset
df = pd.read_json("data/research_papers.json")
df = df.drop(["author", "link", "tag"], axis=1)
print(f"Number of research papers: {len(df)}")
df.head()
Output:
Number of research papers: 41000
| id | title | year |
|---|---|---|
| 1802.00209v1 | Dual Recurrent Attention Units for Visual Question Answering | 2018 |
| 1603.03827v1 | Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks | 2016 |
| 1606.00776v2 | Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation | 2016 |
| 1705.08142v2 | Learning what to share between loosely related tasks | 2017 |
| 1709.02349v2 | A Deep Reinforcement Learning Chatbot | 2017 |
The dataset contains 41,000 ML research papers with id, title, summary, year, month, and day columns. We drop metadata columns (author, link, tag) not needed for search.
Step 5: Load the Embedding Model
model = SentenceTransformer('all-mpnet-base-v2')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
print(f"model: {model}\n device: {device}")
Output:
model: SentenceTransformer(
(0): Transformer({'max_seq_length': 384, 'do_lower_case': False, 'architecture': 'MPNetModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True, ...})
(2): Normalize()
)
device: cuda
Model breakdown:
- MPNet (all-mpnet-base-v2) — High-accuracy semantic embeddings. Uses a hybrid of masked and permuted language modeling for superior context awareness.
- all-MiniLM-L6-v2 (alternative) — Optimized for high-speed inference. Maintains 95% of MPNet's performance at lower latency.
- Mean pooling — Aggregates token embeddings into a fixed-size 768-dimensional vector that represents the sentence's semantic meaning.
Step 6: Generate Embeddings
Option A: Generate new embeddings (~40 seconds on GPU)
embeddings = model.encode(df.summary.to_list(), show_progress_bar=True)
# Save for future use
with open('data/new_embeddings.pickle', 'wb') as pkl:
pickle.dump(embeddings, pkl)
print(f"Shape: {embeddings.shape}")
Output:
Shape: (41000, 768)
Option B: Load pre-computed embeddings
def load_embeddings(file_path, mode='rb'):
with open(file_path, mode) as f:
embeddings = pickle.load(f)
return embeddings, len(embeddings), embeddings.shape
embeddings, length, shape = load_embeddings('data/new_embeddings.pickle')
print(f"embeddings: {embeddings[0]}\n length: {length}\n shape: {shape}")
print(f"Is instance of numpy arrays :{isinstance(embeddings, np.ndarray)}")
Output:
embeddings: [-1.09981090e-01 1.64143533e-01 6.77780509e-01 ...]
length: 41000
shape: (41000, 768)
Is instance of numpy arrays :True
Each paper summary becomes a 768-dimensional vector. Similar concepts produce similar vectors, even with different wording. This is what makes semantic search possible — the model captures meaning, not just keywords.
Step 7: Prepare Data for FAISS
label_encoder = preprocessing.LabelEncoder()
print(f"Data type before encoding: {df['id'].dtype}")
df['encoded_id'] = label_encoder.fit_transform(df['id'])
print(f"Data type after encoding: {df['encoded_id'].dtype}")
df.head()
Output:
Data type before encoding: object
Data type after encoding: int64
| id | title | year | encoded_id |
|---|---|---|---|
| 1802.00209v1 | Dual Recurrent Attention Units for Visual Question Answering | 2018 | 36693 |
| 1603.03827v1 | Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks | 2016 | 18198 |
| 1606.00776v2 | Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation | 2016 | 19318 |
| 1705.08142v2 | Learning what to share between loosely related tasks | 2017 | 27779 |
| 1709.02349v2 | A Deep Reinforcement Learning Chatbot | 2017 | 31468 |
FAISS uses integer indices internally. LabelEncoder maps string paper IDs (like 1802.00209v1) to integers (0 to 40,999).
Step 8: Build the FAISS Index
# Convert embeddings to float32 numpy array
embeddings_np = np.array(embeddings, dtype=np.float32)
def create_gpu_index(embedding_dim):
num_gpus = faiss.get_num_gpus()
print(f"Faiss detected {num_gpus} GPU(s)")
if num_gpus == 0:
print("No GPU available for Faiss, falling back to CPU index")
index = faiss.IndexFlatL2(embedding_dim)
return faiss.IndexIDMap(index)
res = faiss.StandardGpuResources()
config = faiss.GpuIndexFlatConfig()
config.device = 0
gpu_index = faiss.GpuIndexFlatL2(res, embedding_dim, config)
return faiss.IndexIDMap(gpu_index)
gpu_index_map = create_gpu_index(embeddings_np.shape[1])
Output:
Faiss detected 1 GPU(s)
Add all embeddings with their encoded IDs:
gpu_index_map.add_with_ids(
embeddings_np, df["encoded_id"][:length].values.astype("int64")
)
print(f"Number of embeddings in the Faiss index: {gpu_index_map.ntotal}")
Output:
Number of embeddings in the Faiss index: 41000
What each step does:
- GpuIndexFlatL2 — Creates an exact L2 distance search index on GPU 0
- IndexIDMap — Wraps the index to support custom IDs (encoded paper IDs)
- add_with_ids() — Inserts all 41,000 embeddings with their corresponding paper IDs
- CPU fallback — Automatically falls back to CPU if no GPU is detected
Step 9: Implement Search Helpers
Define a helper function to retrieve paper information from search results:
def id_to_info(df, I, column):
return [list(df[df['encoded_id'] == idx][column]) for idx in I]
This maps encoded IDs returned by FAISS back to paper metadata (titles, summaries) in the DataFrame.
Step 10: Test the Search
Search by natural language query:
Use the abstract from the "Attention Is All You Need" paper as a search query:
query = "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data"
embed = model.encode([query])
print(f"embed shape: {embed.shape}")
Output:
embed shape: (1, 768)
Search the index for the 10 nearest neighbors:
D, I = gpu_index_map.search(embed.astype("float32"), k=10)
results = {
'L2 distances': D.flatten().tolist(),
'ML paper IDs': I.flatten().tolist(),
'Titles': id_to_info(df, I.flatten(), 'title'),
'Summaries': id_to_info(df, I.flatten(), 'summary')
}
pd.DataFrame(results).head(10)
Output:
| L2 distance | Title |
|---|---|
| 121.83 | Feature Representation for ICU Mortality |
| 128.81 | A Model of the Mechanisms Underlying Exploratory Behaviour |
| 129.73 | Gibbs Sampling in Open-Universe Stochastic Languages |
| 130.12 | The Voynich Manuscript is Written in Natural Language: The Pahlavi Hypothesis |
| 130.54 | Leveraging Unstructured Data to Detect Emerging Reliability Issues |
Note: Raw L2 distances are not normalized here, so lower values indicate closer matches. For production use, consider normalizing embeddings with
faiss.normalize_L2()before indexing to convert L2 distances to cosine similarity scores.
Verification Checklist
-
faiss.get_num_gpus()returns ≥1 - Embeddings shape is
(41000, 768) - Index contains 41,000 vectors (
gpu_index_map.ntotal == 41000) - Search returns papers with L2 distances
Performance Summary
| Operation | Time |
|---|---|
| Embedding generation (41K papers) | ~40 seconds |
| Search query | ~50-100ms |
| Load pre-computed embeddings | ~2 seconds |
Troubleshooting
"No GPU available" (faiss.get_num_gpus() returns 0)
- Verify CUDA installation:
nvidia-smi - Reinstall faiss-gpu:
pip uninstall faiss-gpu-cu12 && pip install faiss-gpu-cu12 - Check PyTorch CUDA:
torch.cuda.is_available()
"NumPy version incompatibility"
- Pin NumPy:
pip install "numpy<2.0"
"Out of memory" during embedding generation
- Reduce batch size:
model.encode(..., batch_size=32) - Use CPU fallback if GPU memory is limited
Next Steps
- Scale up: Use
IndexIVFFlatfor datasets with millions of vectors - Add filtering: Combine semantic search with metadata filters (year, author)
- Deploy: Wrap in FastAPI for a production-ready API
Repository
Complete implementation: github.com/sheygs/semantic-search
Questions about semantic search? Feel free to reach out.