Code Intelligence

Software code presents a really interesting problem for knowledge extraction. Code is unique because it is highly structured, but expressive enough that it makes sense to parse as natural language.

While for now there seems to be a lot of focus on tools that write code, there isn't as much effort being spent on tools that understand code. Personally, I find codebase understanding to be a much more interesting application. For a couple of years now, I've been attempting to build systems that can explain whole sections of a codebase to new devs, or even explain implementation details to semi-technical stakeholders.

My latest breakthrough is enabled by building a knowledge graph over a code base. In this article we'll walk through what the construction of that knowledge graph looks like, what the graph itself looks like, and how we can leverage it to answer high-level questions about a codebase.

A Sample Codebase

The codebase we'll be using is the code from a cigar-journal app I built in 2021 for a cigar retailer in Vancouver. It's comprised of 121 Typescript files in 6418 lines of code, split across a backend and a frontend.

Building a knowledge graph

Code-level constructs

In the graph we want to build, the nodes are going to be code-level constructs like classes or functions. Initially, we used tree-sitter to explicitly parse the syntax tree of a given code file (which is language-dependent!) in order to extract code-level constructs. Eventually we determined that using an LLM to extract nodes instead was far more flexible (no need to write syntax parsers for each language) and was practically just as accurate.

To determine if two code-level constructs Construct A and Construct B are related, we ask the LLM whether or not Construct B is used by Construct A, and to describe the nature of that usage. While this is actually a directed relationship, we aren't taking advantage of the directional information for now.

Concretely, here are all the code constructs (graph nodes) extracted from the file ./cj-backend/src/pipes/onCigarsOrdered.ts:

And here are the relationships to other code constructs (the edges) that the LLM identified:

- The conditionallySaveReviewPromptRow function determines if a review prompt should be saved based on the user's purchase history retrieved from UserPurchaseModel.
- The onCigarsOrdered function checks if a user exists in UserProfileModel and if they have a pending account in UserCreatePendingModel. If not, it creates a new user using getNewDefaultUser and saves the user profile.
- The onCigarsOrdered function orchestrates the creation of user accounts, saving of purchase records, and handling of review prompts based on user actions. It interacts with UserProfileModel to check for existing users, UserCreatePendingModel for pending accounts, UserPurchaseModel for purchase records, and ReviewPromptModel for managing review prompts.
- The savePurchaseRecords function saves purchase records for cigars ordered by the user using the putRow method from UserPurchaseModel.
- The saveUserPendingRow function saves a pending user record in the database using the UserCreatePendingModel.

Concretely, this is how edges are defined internally:

At this stage in the process, for every file in the codebase, we've identified:

- all the important code constructs defined in the file
- how each of those code constructs uses code constructs from other files

This defines a graph on all code constructs in the codebase. It's quite noisy, but you can visualize it below:

SHOW GRAPH

Identifying Graph 'Communities'

In graph theory, a community of nodes is a part of the graph that is more densely connected inside of itself than with the rest of the graph. Our working hypothesis is that each community we identify in the codebase knowledge graph will correspond to a piece of the codebase that serves a well-defined purpose. We're hoping that we can use a community detection algorithm to split up the codebase programmatically in the same way we would do ourselves in order to explain it to someone else.

Let's take a look at how the code is organized with the 'Leiden' community detection algorithm.

Using an LLM to assign a name to each identified graph community, we end up with communities named like:

- User Management Subsystem
- Execution Logging Subsystem
- API Request Handler
- ComponentPropsSubsystem
- DatabaseServiceSubsystem
- UI Rendering Subsystem
- ReviewPromptingSubsystem
- Cigar Management Subsystem

Where each community is associated with a group of code constructs:

These communities are quite remarkable - they are specific and accurate, and could really help a dev to navigate the codebase. To that point however, we identified 73 root-level communities, which is still too many for a quick read. We'll keep grouping up the codebase until we have only a handful of sub-systems identified.

Higher-level Clustering

To continue clustering our code communities, we'll leverage the LLM-generated summaries of each community, and attempt to group them based on similar functionality. To do this, we'll compute the embedding of each community summary, then use K-Means clustering to group the most similar communities together. We can then summarize the new groups with the LLM and repeat the grouping process until we're left with a single group.

Using the process defined above, we end up with the following structure over our code communities (hover over a node for the LLM-generated description):

SHOW GRAPH

What can we do with all this?

This top-down organization of the codebase holds the key to modeling the concepts within it. As an illustrative use-case, we can build a report over the codebase that explains its most important features, and point us to the code that implements them. The following is an auto-generated report about the cigar journal code base - it's entirely written by the LLM, based on the knowledge graph we built:

High-level summary

The system is designed to manage a wide range of functionalities within an application, focusing on user operations, data management, and external integrations. It includes mechanisms for caching and retrieving cigar data, handling user creation and operations, and managing SMS communications. The system also supports health checks, logging, and session management to ensure smooth operation and monitoring. Additionally, it integrates with Shopify for e-commerce functionalities and provides a structured review process for cigars. The UI components facilitate user interaction, while input handling ensures accurate data entry. API routing organizes endpoint access, and the entire system is designed to operate efficiently across different environments.

Major Modules and Components

ActiveReviewManagementSubsystem

FUNCTION cj-frontend/hooks/activeReviewHooks.ts::useActiveReview

- Custom hook to access the active review state
TYPE cj-frontend/state/activeReviewState.ts::ActiveReviewState

- Interface defining the structure of ActiveReviewState
CONSTANT cj-frontend/state/activeReviewState.ts::defaultActiveReviewState

- Constant representing the default state of ActiveReviewState

RefreshTokenManager

FUNCTION cj-backend/src/service/refreshTokenMgr.ts::issueRefreshToken

- Function to issue a new refresh token for a user
FUNCTION cj-backend/src/service/refreshTokenMgr.ts::saveRefreshToken

- Function to save a refresh token for a user
FUNCTION cj-backend/src/service/refreshTokenMgr.ts::invalidate

- Function to invalidate a refresh token

RevolucionShopifyService

FUNCTION cj-backend/src/service/revolucionShopifyService.ts::loadStore

- Function to load the store data from Shopify
FUNCTION cj-backend/src/service/revolucionShopifyService.ts::loadPage

- Function to load a page of products from Shopify
FUNCTION cj-backend/src/service/revolucionShopifyService.ts::savePage

- Function to save the page data to the database

API Endpoints Subsystem

FUNCTION cj-backend/src/app.ts::app

- The main application instance
CLASS cj-backend/src/app.ts::UserEndpoints

- Class representing user-related endpoints
CLASS cj-backend/src/app.ts::WebhookEndpoints

- Class representing webhook-related endpoints

Session Management Subsystem

CLASS cj-frontend/service/backendService.ts::backendService

- An object representing the backend service with various methods for user and cigar operations
FUNCTION cj-frontend/hooks/logic/useSessionLogic.ts::useSessionLogic

- Custom hook for session logic
CONSTANT cj-frontend/service/routerService.ts::routerService

- Exported constant object containing router service methods

Until next time,

Liam