# Building a Retrieval System for a RAG application
[Jack Munday](jack.charlie.munday@cern.ch)

In this exercise, we will work through building a basic retrieval system for a RAG application using the [langchain framework](https://www.langchain.com/). 

Langchain is a framework that makes it easier to build LLM-enabled applications. It offers a suite of tools that make it easier to handle common tasks when working with LLMs, this includes: 

* Document Processing: Loading, Splitting, Retrieval & Storage
* Embedding & Vector Stores: Embedding, Similarity Search
* Pipelines: LLM Chaining, Prompting Templates, RAG, Agentic Workflows
* Memory: Conversation, Long-term Storage, Context management
* and much more...

Please note that there are many different frameworks that are similar to `langchain`, I have chosen it for this exercises as it is one of the most popular frameworks with good documentation. This makes it a good starting point for learning, but using it or other similar tools is by no means a requirement for building a RAG system. They allow for rapid prototyping as the simplification and abstraction of all of these tasks makes it very easy to get started without needing to worry about specific implementation details. 

Although generally when you progress with development of a production-grade RAG application it is often the case that you move to custom implementations that are tailored to your usecase. e.g. Custom parsing and splitting for your documents.

## Installing Dependencies

If you are having issues with your installation running, please make sure that you have started your SWAN session with the flag  `"Use python packages installed on CERNBox"` enabled.

In [None]:
! pip install langchain langchain-community langchain-huggingface pypdf nbconvert langchain-chroma faiss-cpu --user -q

# Parsing Documents

As you would expect `langchain` has a rich set of functionalities for processing documents. At the core of this is the `Document` python type: which compromises of a piece of text and some metadata about the text itself. 

*Please note the slightly confusing nomenclature, when a raw document (e.g. news article, contract, academic paper) is loaded and parsed by `langchain` it will generate many `Document`'s (one `Document` per page in your document). To avoid confusion when referring to what you would typically think of a document I refer to using the standard formatting, while when referring to the langchain `Document`  type it will be in mono-spaced fonts.* 

As discussed in the presentation simply storing the text of a document is not enough to support more advanced RAG use cases, typically we want to store metadata about each `Document` that will allow retrieved chunks to be later evaluated for relevance before being sent to the LLM for generation. `langchain` offers support for populating standard pieces of metadata that you want to attach to each chunk: e.g. master document, page_number, chunk_position etc.

In [None]:
# ref: https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]
documents[1]

We will begin by investigating the impact of different chunking strategies on a given document to see how this changes the size of each chunk and the number of chunks generated per document.

Getting the right chunk size is typically a compromise between context and noise.
* Larger Chunks will contain more (hopefully useful) context for your LLM downstream, however will typically contain more noise (information unrelated to the question or query at hand).
* Smaller Chunks will contain less context, but naturally with that will typically have less noise.

There is no one correct approach to chunking your documents or selecting chunk size. Your choice in strategy will typically depend on the types of documents you want to be able to query over, how they are formatted (e.g. html, academic papers, news articles) and how the chunks will be used downstream / implementation details of your RAG pipeline.

## Uploading a Document to SWAN

Upload a sample PDF to SWAN that you can test the parsing and chunking strategies of `langchain`. The choice in document really does not matter to much for the purpose of this exercise, but if you would rather not got through the choice, I have attached the academic paper "Jet Substructure at the Large Hadron Collider: A Review of Recent Advances in Theory and Machine Learning". This should be available at the file path `"./jet-substructure.pdf"` from inside this notebook.

In [None]:
# --- exercise
# 1. Choose a document to upload to SWAN (or use the provided academic paper).
# 2. Use one of the `langchain` document loaders to parse your document into a series of `Documents`.
# 3. Does the number of `Documents` generated match what you would expect?
# 4. Is their any useful metadata missing from each `Document` that could be useful downstream for RAG?
# ---

If you need a hint, please expand here.




https://python.langchain.com/docs/how_to/document_loader_pdf/

## Splitting your Document into Chunks

Whilst breaking down your document into page-sized chunks will have significantly reduced your chunk size, each chunk is still too large to fit into a single embedding.

As a reminder the typical optimum chunk size for embedding is roughly paragraph sized and the maximum number of tokens for most embedding models is typically ~512 tokens. We would therefore like to break down the pages a little future to get closer to these values.

During the presentation we discussed a number of chunking strategies:
* Fixed Token Window
* Sentence Level Chunking
* "Document Aware"
* Semantic Chunking

We will now have a go at using each of these to break down our pages.

See the [langchain_text_splitters](https://python.langchain.com/api_reference/text_splitters/index.html) library for more information on getting started.

### Fixed token window
Let's start with the simplest: implementing a fixed token window.

In [None]:
# --- exercise
# 1. Choose a text splitter to parse your previously uploaded document in the prior task.
# 2. Use this splitter to generate chunks (these are the chunks that we will embed and then use in the retrieval stage later).
# 3. Confirm that your chunk(s) are the appropriate size for your selected chunk size.
# 4. Are you appropriately tracking the chunk_index? This will be useful if you decide to use Small2Big Retrieval which requires 
#    understanding of locally near chunks to build a bigger context window around a smaller chunk.
# ---

### Sentence Chunking

Let's compare this approach with splitting at the sentence level. We can naively achieve this using a character splitter on `"."`: naturally this simplification will have limitations however `langchain` offers support for a minimum `chunk_size`, which avoids a lots of the common problems.

e.g. `J. C. Munday.` being parsed into 3 chunks: `J.`, `C.` and `Munday.`

In [None]:
# --- exercise
# 1. Use a character splitter to generate chunks on the sentence level for the same document. Choose an appropriate chunk size.
# 2. How does the number of chunks and average chunk size compare to that of the fixed token window?
# 3. Inspect a random sample of the chunks, do you observe any edge cases of where our simplification does not work?
# ---

### Document Aware Splitting

Document-structured based documents allows one to take advantage of the in-built context signifier in a document to build chunks, e.g. `</p>` tags in `html` documents for example.

`langchain` also offers support for this. Please consult [here](https://python.langchain.com/docs/concepts/text_splitters/#document-structured-based) to confirm whether your choice in uploaded document is supported by one of their custom parses.

In [None]:
# --- exercise
# 1. Use a document-structure based parser to generate chunks for your document.
#    (Please note this is only possible if your document is a structured file i.e. html, markdown, json etc.)
# 2. How does the number of chunks and average chunk size compare to that of the fixed token window and sentence level chunking?
# ---

### Semantic Splitting

`langchain` offers support semantic splitting, which builds chunks based on sentences that are speaking about similar contexts.

Before attempting this exercise, please go through the below exercises on embedding documents with `langchain` as using an embedding model will be required to progress.

For more information on getting started please see [here](https://python.langchain.com/docs/how_to/semantic-chunker/).

In [None]:
# --- exercise
# 1. Use a semantic-splitting based parser to generate chunks for your document.
# 2. How does the number of chunks and average chunk size compare to that of the prior used methods?
# ---

Choosing the optimum chunking strategy can not be done in isolation, it should be evaluated as a component of your whole retrieval system. So to get started with building the rest let's look into embedding documents.

## Embedding Documents

`langchain` offers support for a number of different providers for embedding. At its heart this is either:
* abstracting an API request to an externally hosted API.
* loading a model into memory and passing your text through it.

Writing a service for generating embeddings is fairly easy due to the wide availability of pre-trained models for embedding. Development often simply requires wrapping them inside of an API so the service can be scaled independently. 

For the purpose of this exercise I recommend you make use of `HuggingFaceEmbeddings` for its simplicity. However if you subscribe to an LLM provider already (GPT-x, Claude, Gemini etc), please feel free to use these services to generate your embeddings (you will need to pass in an `API_KEY` to authenticate yourself), however these are paid services whilst access to (most) hugging face models is free.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

Browse https://huggingface.co/models and choose a model to generate your embeddings. You will want a model that has been trained for "Sentence Similarity" or "Semantic Search": these are models specifically for generating embeddings.

In [None]:
# --- exercise
# 1. Create your embeddor. (I recommend you do this in its own cell as the process of downloading and loading to 
# memory can be a bit slow and you do no not want to re-instantiate the model with every request to generate embeddings).
# NOTE: If using HuggingFaceEmbeddings you may get a cvmfs warning related to 'resume_download` being deprecated, please feel 
# free to ignore this one.
# ---

In [None]:
# -- exercise
# 2. Use this embedding model to generate an embedding of a sample query.
# 3. Inspect the query and confirm that your embedding dimension matches that listed in your model documentation.
# ---

### Generating a similarity matrix

We will now build a similarity matrix to compare the similarity of a series of sentences - this will be a visualisation how similar all of the `sample_sentences` are to eachother. (Naturally the diagonal of this matrix will be all `1`s.)

As a quick reminder, given vectors $\vec{A}$ and $\vec{B}$, their similarity can be calculated using their dot product($S = \vec{A} \cdot \vec{B}$) or cosine similarity $(S = \frac{\vec{A} \cdot \vec{B}}{|A||B|})$, where a larger value of $S$ corresponds to a more simliar set of vectors.

Your choice in similarity metric will depend on:
1) Whether you have decided to normalise embeddings.
2) Whether your embedding model has been fine tuned for a given metric.

In [None]:
sample_sentences = [
    "CERN uses the Large Hadron Collider, the world's most powerful particle accelerator, to study fundamental physics.",
    "CERN's Large Hadron Collider was instrumental in the discovery of the Higgs boson in 2012.",
    "The world record time for running 26.2 miles is 2:00:35 and is held by Kelvin Kiptum.",
    "Usain Bolt's 100m world record was set in the 2009 Berlin World Championships."
]

In [None]:
# --- exercise
# 1. Generate embeddings for each of the sample_sentences.
# 2. Choose a similarity matrix according to your model and whether you have decided to normalise your embeddings.
# 3. Build a similarity matrix which can be used a lookup table to determine the simlirity of `S_1` against `S_2`.
# 4. Create a heatmap style plot of this matrix we can be used to visually see the "closeness" of two sentences.
# ---

### Querying to find the most similar vector

We will now make use of these embeddings and similarity metrics to retrieve relevant information based on a query.

We would like to answer the question `"who is the fastest marathon runner?"` based on the sample sentences above. Clearly a single sentence in our sample sentences is highly relevant to this query, whilst the others do not pertain to it.

Generate a similarity score between this query and each of the sentences in `sample_sentences` (you do not need to make use of any fancy storage methods offered by `langchain` it is absolutely fine to store in-memory using a plain python `Dict`). From these scores programatically identify the most relevant sentence to the query.

In [None]:
# --- exercises
# 1. Generate a similarity score between the question and each of the sentences in `sample_sentences`.
# 2. Use this similarity score to rank each of the sentences for relevance against the query.
# ---

While I appreciate this example is very basic, I hope it helps to illustrate the power of using embeddings for search. 

The embedding model has been able to correctly distinguish that a marathon is 26.2 miles and not 100m as shown by higher simliarity scores (proxy for relevance). Whilst sentences reffering to operations at CERN predictable are attributed a low simlarity to topics related to running.

## Working with Vector Databases

In the approach above we have stored all of our embeddings in memory, this clearly does not scale for a production system. There are several probelms with this:

* As we store more and more documents we will reach a point where their embeddings can not all be stored in RAM.
* Performance of search is limited as there is no easy way for us to parallises searching.
* We are manually managed all of our data stuctures for storage (unless you are a DSA expert) this is likely going to result in sub-optimal performance versus using an off the shelf component.

Configuring a production vector database clearly extends beyond the scope of this exercise. (Please refer back to the slides for details on the challenges of running a production vector database.) With this in mind we will continue to store our vector database that sits within this notebook as the number of documents we are storing will be low so this should continue to work reasonably well (and it massively simplifies the process).

As with everything else so far `langchain` offers APIs for storing vector databases. As we are not using an external database we have a choice between using FAISS (in-memory database) and ChromaDB.

In [None]:
##! pip install faiss-cpu --user -q

In [None]:
# --- exercise
# 1. Initialise an in-memory vector store using langchain.
# 2. Load documents of your choosing into the vectore store.
#    I'd recommend starting with the sample_sentences as this will 
#    allow you to get round some of the kinks with using the APIs and
#    we can then use the same code to process the document loaded at 
#    the start of the exercise.
# 3. Perform a similarility search on your documents.
# ---

# Generation

Access to production grade LLM is not the easiest to come by, you either need a GPUs to run your own inference service or a credit card to pay some one else to do this for you... unfortunately neither of these things I have to hand.

We will now attempt to make use of HuggingFace models also in the generation stage, but please be patient and bear in mind that even if you have a very basic request it may be very slow. I similar recommend that you make use of the sample_sentences for this exercise, please feel free to add your own.

Please also bear in mind that we are typically only going to be able to run a tiny model in memory and therefore the performance that you get out of this will be reflective. We are also implementing only the simplest form of RAG and so we would typically expect there to be a large amount of hallunications.

In [None]:
# --- exercise
# 1. Construct a prompt_template containing the placeholders for the system prompt, retrieved chunks and users question.
# ---

In [None]:
# --- exercise
# 2. Create a HuggingFace Pipeline (this will load the model locally for use in inference)
# ---

In [None]:
# --- exercise
# 3. chain your prompt with your hugging face pipeline (your model)
# if your prompt template is called `prompt` and your model `llm` you can do this as follows:
# chain = prompt | llm
# ---

In [None]:
# --- exercise
# 4. Ask a question to your llm using the prompt providing relevant chunks from your semantic search.
# ---