(Please, follow along in the code example for this article.)

GraphRAG

In theory, the concept of Retrieval Augmented Generation (RAG) is pretty general - basically, anytime we search for relevant information to a user's query and include that in our prompt, we are doing a kind of RAG. However in practice, there is a lot of emphasis put on how RAG systems retrieve their relevant context.
I think the most interesting and probably the most promising method of fetching context is via a knowledge graph. It's difficult for a RAG system to provide good performance when answering high-level, conceptual questions when its source of context is purely the text from the source documents themselves: Higher-level conclusions can usually only be achieved by including huge amounts of the corpus text in a single prompt.
Intuitively, when we build a knowledge graph from our documents, we are 'pre-computing' these higher-level conclusions into a well-defined structure. Thus, I'd suggest that to facilitate higher-level question answering, you should include a knowledge graph in your RAG system.

At a high level, GraphRAG is really a combination of 3 distinct tasks:

1. Automated knowledge graph creation
2. Retrieving relevant information from the knowledge graph
2. Inclusion of relevant knowledge graph information into LLM text generation

Since the performance of tasks 2 and 3 is largely determined by the quality of the knowledge graph to begin with, the major concept we'll focus on is the automated creation of a knowledge graph.

Knowledge Graphs

So what is a 'knowledge graph' exactly? Knowledge graphs are a way of modeling information using entities and relationships between entities. In the more general definition of a graph, each entity of a knowledge graph is a node, and each relationship is like a labelled edge between two nodes.
Consider the statements:

'Liam likes mangoes'

and

'Liam doesn't like pomegranates'

A sensible graph structure we could apply to this information would probably have 3 entities (nodes) and 2 edges in it:

What's the point of a knowledge graph?

So why bother with modeling information this way? Consider that we need to enforce some structure on information in order to attempt to work with it programmatically. In addition, graphs are a deeply studied structure in mathematics, so we can leverage a huge amount of math to analyze our knowledge graphs.

A GraphRAG Implementation

The approach to constructing a GraphRAG system that we'll follow is adapted from the Microsoft GraphRAG implementation. There is an accompanying paper which we will essentially walk through to arrive at our own GraphRAG implementation (see the accompanying repository to this article). A useful knowledge graph is created with the following steps:

1. Split the provided documentation into chunks that we can fit into an LLM's context window
2. For each chunk, extract the entities and relationships described within it into a knowledge graph for that chunk
3. Merge the knowledge graphs from each chunk into one, top-level knowledge graph
4. Partition the top-level knowledge graph into communities, each of which representing a distinct concept within the knowledge base at hand

A premise of this GraphRAG approach is that we can be 'extra-noisy' (ie. perhaps create redundant edges/nodes) when building each knowledge graph, because ultimately we will interface with the 'communities', which abstract over any redundancies in their subgraphs.

Knowledge graph construction walkthrough

Chunking

Theoretically, you can probably get away with chunking the document corpus based on the size of your LLM's context window. In practice however, I've found that there can be too much information within large chunks to fit within the output limits of many models, not to mention that there can be a dropoff in the quality of the output. For that reason, chunk size is probably something you'll need to tune based on the specific corpus at hand. See the 'document split' build step and the document 'splitter' implementation.

Entities

In the Microsoft implementation, entities and relationships are extracted simultaneously. To simplify the task for the LLM, my approach was to break this up into 2 requests - one to extract entities and a subsequent request to extract relationships between those identified entities. Something else I think is valuable to do at this stage is to specify the specific kinds of entities that we are interested in. We can even enforce this with an "enum" property in the JSON schema of our output shape. In the Meta earnings call transcripts example, the entity types I enforced are:

 
  ["ORGANIZATION",
  "EVENT",
  "PRODUCT",
  "PERSON",
  "OPPORTUNITY",
  "CHALLENGE"]

See the entity / relationship extraction step and the entity extraction implementation

Relationships

In this step, for the same chunk that we just extracted entities from, we'll ask the LLM to extract all the meaningful relationships between those identified entities. See the relationship extraction implementation. The GraphRAG implementation goes to great lengths to do this exhaustively - it actually performs another LLM request to check whether or not 'all relationships have been extracted', and repeatedly performs the 'relationship extraction' request until satisfied. One major difference between the approach I took and that of the Microsoft paper is that I allowed for multiple relationships between entities, whereas the Microsoft paper does not. This results in a significantly different kind of graph called a multigraph, that allows for such multiple edges between nodes. This will become relevant when we discuss the community detection algorithm.

Merging graphs

Now we have many many small knowledge graphs, and would like to combine them into one 'master' graph. The approach in GraphRAG is actually very straightforward - if node A from graph G represents the same entity as node A from graph H, we create a node A' in the merged graph that has every edge from A in G and every edge from A in H.

New nodes are simply added with all their edges. There is no attempt to model multiple relationships between nodes: two entities are only ever connected by a single edge (see the edge merging implementation and the default edge merge operation). Basically, the last relationship encountered between two entities in the merging process will be the only relationship between them in the final graph.
In my implementation, because I'm allowing for multiple edges between nodes, I make an attempt to de-duplicate among the edges of a merged node using the cosine similarity of their descriptions. One other approach that would maintain a single edge between nodes is to constantly summarize (using an LLM) the relationship between two entities as we merge in different edges between them.

Communities

For the community-detection phase we're going to leverage some of the powerful graph theory mathematics we mentioned earlier. The GraphRAG implementation uses an algorithm called 'Leiden community detection' to detect its communities (link). Unfortunately, the Leiden algorithm can only be run on strict graphs and not multi-graphs like I created in my implementation. Thus I was forced to use a different algorithm called 'Louvain community detection' which can operate on multi-graphs. The Leiden algorithm was developed as an improvement to some deficiencies with Louvain, but for a proof-of-concept I elected to accept these deficiencies.

Querying a knowledge graph

The intuition here is that each community is able to offer a topic-specific response to a query, and by combining these responses, we are able to answer very high-level questions about the information in the source documents.

My approach

1. Identify the most similar node and edge in each community (link)
2. Exclude communities from the response that are too dissimilar (link)
3. For each relevant community, generate an answer to the query with the context of relevant entity and relationship descriptions (link)
4. Remove responses that are deemed to be irrelevant or ungrounded in the source material (link)
5. Generate a final response to the query using the answers from each relevant community (link)

Microsoft GraphRAG approach

The paper's implementation is very similar to the one in minikg, except that relevant communities are identified using a vector search over summarizations of each community. The major reason I did not take this tack was because of cost - I had already spent about $20 on OpenAI credits getting my implementation to this point, and wasn't keen to spend more on summarizing each community. Realistically, I think the 'community summarization' approach is likely superior for identifying relevant communities.

Closing remarks

From my perspective, the concept of 'GraphRAG' is by no means a one-size-fits-all concept with a clear general-purpose implementation. While the Microsoft implementation is certainly well thought out and usable, there is no reason to believe that all of the design choices within it are the correct ones to make, especially for your unique use case.
I think structured knowledge representation is extremely powerful in the era of LLMs, and knowledge graphs are a promising tool to enable such representations. You're encouraged to build upon or borrow large parts of the minikg codebase in order to construct your own GraphRAG implementation, or simply to hack on automated construction of knowledge graphs.

Until next time!
Liam, lover of manoges