Explanation of Chunk Ensembling
 
                Chunk Ensembling is a retrieval optimization technique that balances precision and context by retrieving multiple chunk sizes simultaneously and re-ranking them for relevance. Instead of relying on only one chunk size, this approach ensures that the system retrieves both small, highly precise chunks and larger, context-rich chunks, providing a more comprehensive retrieval experience.
How Chunk Ensembling Works
- Multi-Scale Chunk Indexing- The document is indexed in multiple ways:- Small chunks (128-256 tokens) for precise matching.
- Medium chunks (512-1024 tokens) for sentence and paragraph-level context.
- Large chunks (2000+ tokens) for broader document understanding.
 
 
- The document is indexed in multiple ways:
- Parallel Retrieval- When a query is made, the retrieval system fetches multiple chunk sizes simultaneously from a vector database (e.g., FAISS, Pinecone, Weaviate).
- The system ensures that both detailed fact-level and contextually relevant information is retrieved.
 
- Re-Ranking the Results- Once different-sized chunks are retrieved, they are scored based on:- Semantic similarity to the query.
- Context completeness (whether enough supporting details exist).
- Query intent alignment (whether the chunk directly answers the user’s need).
 
- The best chunk (or combination of chunks) is selected for final retrieval.
 
- Once different-sized chunks are retrieved, they are scored based on:
- Dynamic Merging of Chunks (If Needed)- If small chunks alone lack context, the system dynamically merges them to form a coherent response before passing the final result to the LLM.
 
Why Use Chunk Ensembling?
- Improves Accuracy – Ensures that retrieval includes both precise answers and full context.
- Reduces Hallucination – By merging chunks dynamically, it prevents the model from making assumptions.
- Optimizes LLM Input – Sends the most relevant data into the LLM, reducing token waste.
- Enhances User Experience – Responses become more informative, improving AI comprehension.
Here's a Python implementation of Chunk Ensembling, demonstrating how to retrieve multiple chunk sizes and re-rank them for the best result. This example assumes the use of FAISS (Facebook AI Similarity Search) for vector storage and BM25 (text-based retrieval) for keyword search.
import faiss
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
# Load a sentence embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# Example document chunks (simulating multiple chunk sizes)
documents = [
    "AI is transforming industries worldwide.",  # Small chunk (128 tokens)
    "Artificial intelligence is being used in healthcare, finance, and technology sectors to improve efficiency and decision-making.",  # Medium chunk (512 tokens)
    "Over the past decade, machine learning and deep learning models have been widely adopted in various industries, offering unprecedented levels of automation and insights into data-driven decision-making processes.",  # Large chunk (1024 tokens)
]
# Compute embeddings for each chunk
chunk_embeddings = np.array([embedding_model.encode(doc) for doc in documents])
# Create a FAISS index (for vector search)
dimension = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # L2 distance-based search
index.add(chunk_embeddings)  # Store vectors
# BM25 keyword search setup
tokenized_docs = [doc.split(" ") for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
def ensemble_retrieval(query, k=2):
    """
    Retrieves relevant chunks using both FAISS (vector search) and BM25 (keyword-based search).
    Re-ranks them based on combined scores.
    """
    # Compute embedding for query
    query_embedding = embedding_model.encode(query).reshape(1, -1)
    # FAISS Vector Search
    _, faiss_results = index.search(query_embedding, k)  # Retrieve top-k vector matches
    
    # BM25 Keyword Search
    bm25_scores = bm25.get_scores(query.split())  # BM25 relevance scores
    
    # Normalize BM25 scores (0-1 scaling)
    bm25_scores = np.array(bm25_scores)
    bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-5)
    # Aggregate scores (Weighted sum of FAISS & BM25)
    combined_scores = []
    for i in range(len(documents)):
        vector_rank = -np.linalg.norm(chunk_embeddings[i] - query_embedding)  # Higher is better
        combined_score = 0.5 * vector_rank + 0.5 * bm25_scores[i]  # Adjust weighting as needed
        combined_scores.append((combined_score, documents[i]))
    # Sort results by best combined score
    ranked_results = sorted(combined_scores, key=lambda x: x[0], reverse=True)
    return [doc for _, doc in ranked_results[:k]]
# Example Query
query_text = "How is AI transforming industries?"
results = ensemble_retrieval(query_text, k=3)
# Display results
print("\n🔹 Top Retrieved Chunks:")
for i, res in enumerate(results):
    print(f"{i+1}. {res}")
Explanation of the Code
- Multi-Scale Chunking- We store small (128 tokens), medium (512 tokens), and large chunks (1024 tokens) in a vector database (FAISS).
- BM25 is used for exact keyword search to complement semantic retrieval.
 
- Dual Retrieval Mechanism- FAISS Vector Search: Finds the closest semantic matches to the query.
- BM25 Keyword Search: Identifies exact word matches for relevance.
 
- Re-Ranking Strategy- FAISS matches are ranked by vector distance.
- BM25 scores are normalized and combined.
- The final ranking selects the most relevant chunk(s) using a weighted score.
 
Example Output

 
                     
         
         
         
        