Writing a code search tool

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

Scenario: you need to add a parameter to the code that sends email so new data shows up in the emails, BUT you don’t remember exactly where that function is.

You could do a full text search for email:

https://preview.redd.it/ztwtv69ujl5d1.png?width=722&format=png&auto=webp&s=c848ff8f649e62268fb375373ce35baf3f8202c5

Not helpful.

It’d be great if you could literally search “code that sends email” and get results of exactly that.

Solution: translate code into natural language. We can convert the translation into embeddings to enable natural language search across a codebase.

First, we’ll need the code. Let’s write a function to search our project directory and load all the code into memory:

type FileInfo {
    content: string
    path: string
    filename: string
}

function findTsFiles(directory: string, exclude_dirs: string[]): void {
    const fileList: FileInfo[] = []

    function traverseDirectory(currentDir: string) {
        const files = fs.readdirSync(currentDir)

        files.forEach((file) => {
            const filePath = path.join(currentDir, file)
            const stats = fs.statSync(filePath)

            if (stats.isDirectory()) {
                if (!exclude_dirs.includes(file)) {
                    traverseDirectory(filePath)
                }
            } else if (stats.isFile() && (file.endsWith('.ts') || file.endsWith('.tsx'))) {
                const fileInfo: FileInfo = {
                    content: fs.readFileSync(filePath, 'utf-8'),
                    path: currentDir,
                    filename: file
                }
                fileList.push(fileInfo)
            }
        })
    }

    traverseDirectory(directory)

    // save to a json file so we don't have to do this on every run
    const jsonContent = JSON.stringify(fileList, null, 2)
    fs.writeFileSync('files.json', jsonContent)

  return fileList
}

I have a nextjs project, so i’m looking just for .ts and .tsx files.

We’ll want to break the files up into functions so that when we perform the search, our query results will be specific enough to match the query. Since we’re working with Javascript files, we can use babel’s parser to extract the functions:

export function extractFunctions(fileContents: string): ExtractFunctionResponse[] {
    const functions: ExtractFunctionResponse[] = [];

    try {
        const ast = parser.parse(fileContents, {
            sourceType: 'module',
            plugins: ['typescript', 'jsx'],
            allowImportExportEverywhere: true,
            allowAwaitOutsideFunction: true,
            allowReturnOutsideFunction: true,
            allowSuperOutsideMethod: true,
            allowUndeclaredExports: true,
            errorRecovery: true,
        });

        traverse(ast, {
            FunctionDeclaration(path) {
                const node = path.node;
                functions.push({
                    function: fileContents.slice(node.start, node.end),
                    functionName: ,
                });
            },
            VariableDeclarator(path) {
                const node = path.node;
                if (node.init && types.isArrowFunctionExpression(node.init)) {
                    functions.push({
                        function: fileContents.slice(node.start, node.end),
                        functionName: ,
                    });
                }
            },
            ExportNamedDeclaration(path) {
                const node = path.node;
                if (node.declaration) {
                    if (types.isVariableDeclaration(node.declaration)) {
                        node.declaration.declarations.forEach((declarator) => {
                            if (
                                declarator.init &&
                                types.isArrowFunctionExpression(declarator.init)
                            ) {
                                functions.push({
                                    function: fileContents.slice(declarator.start, declarator.end),
                                    functionName: ,
                                });
                            }
                        });
                    } else if (types.isFunctionDeclaration(node.declaration)) {
                        functions.push({
                            function: fileContents.slice(
                                node.declaration.start,
                                node.declaration.end
                            ),
                            functionName: ,
                        });
                    }
                } else if (node.specifiers && node.specifiers.length > 0) {
                    node.specifiers.forEach((specifier) => {
                        if (types.isExportSpecifier(specifier)) {
                            const exportedName = ;
                            const binding = path.scope.getBinding(exportedName);
                            if (binding && binding.path.isVariableDeclarator()) {
                                const declaratorNode = binding.path.node;
                                if (
                                    declaratorNode.init &&
                                    types.isArrowFunctionExpression(declaratorNode.init)
                                ) {
                                    functions.push({
                                        function: fileContents.slice(
                                            declaratorNode.start,
                                            declaratorNode.end
                                        ),
                                        functionName: exportedName,
                                    });
                                }
                            }
                        }
                    });
                }
            },
        });
    } catch (error) {
        console.error('Error parsing file:', error);
    }

    // remove duplicates with the same function name
    const uniqueFunctions = functions.filter((func, index, self) =>
        index === self.findIndex((f) => f.functionName === func.functionName)
    );

    return uniqueFunctions;
}node.id.namenode.id.namedeclarator.id.namenode.declaration.id.namespecifier.exported.name

There are many ways to contain chunks of code in Javascript, but this code catches the general style in my particular codebase. If you use classes or other structures, use an LLM to update it to be able to parse your own code styles.

Since we want to use queries like code that sends email , we'll need to convert each of these functions to natural language. We can do that using an LLM to generate descriptions for each function. This is probably the most important part that'll determine how good the search results will be.

const getFunctionDescription = async (file: FileInfo, functionName: string): Promise<string> => {   
  const prompt = `given the following file ${file.path}/${file.filename}:
\`\`\`typescript
${file.content}
\`\`\`

Provide a concise description for the function \`${functionName}\`.
The description should briefly explain the purpose of and the context around the function and any notable aspects or behavior.
Aim for a description that is around 2-3 sentences long.
Respond ONLY with the description.`

    // wrap in a try/catch in case the request fails
    try {
        const response = await ollama.chat({
            model: 'phi3:medium',
            messages: [{
              role: 'user', 
              content: prompt
            }]
        })

        return response.message.content
    } catch (e) {
        console.error(e)
        return null
    }
}

I opted to include the whole file and instructed the LLM to write a description for the named function to provide a little more context about the function. There’s lots of tweaking (i.e. prompt engineering) you can do here, and again, the quality of the description is probably the most important part of this process, so make sure your prompt/model combination yields really good results.

I’m using ollama to provide a local server that hosts open source models (and ollama-js for easy access to it), but feel free to use your favorite model. Right now, I think the Claude 3 Opus model would do the best with use-case and prompt combination.

Let’s use the functions so far to collect our data:

// new type for functions in the files
type FunctionInfo {
    function: string
    functionName: string
    description: string
}

// update `FileInfo` to have function information 
type FileInfo {
    content: string
    path: string
    filename: string
    functions: FunctionInfo[]
}

// get the files
const files = findTsFiles('src')

const filesPromise = files.map(async (file) => {
  // extract the functions
  const funcs = extractFunctions(file.content)

  // generate description for each function
  const withDescriptionsPromises = funcs.map(async (func) => ({
    ...func,
    description: await getFunctionDescription(file, func.functionName)
  }))

  // unwrap promises from async `.map`
  const withDescriptions = Promise.all(withDescriptionsPromises)

  // return file data with new functions data
  return {
    ...file,
    functions: withDescriptions
  }
})

// unwrap the promises with `Promise.all`
const filesWithFunctionDescriptions = await Promise.all(filesPromise)

Let’s write the function we’ll use to turn the descriptions into create embeddings:

export const createEmbedding = async (text: string) => {
    const resp = await ollama.embeddings({
        model: 'mxbai-embed-large',
        prompt: text
    })

    return resp.embedding
}

I’m using ollama to access the [mxbai-embed-large](https://ollama.com/library/mxbai-embed-large) embedding model. I'm not super familiar with how to choose the best embeddings models, but it’s the best one ollama has to offer and It worked well enough for my use-case. Feel free to use a different one.

Let’s use the new function:

// add `embedding` field to `FunctionInfo` 
type FunctionInfo {
    function: string
    functionName: string
    description: string
    embedding: number[]
}

const withEmbeddings = await Promise.all(filesWithFunctionDescriptions.map(async file => ({
  ...file,
  functions: await Promise.all(file.functions.map(func => ({
    ...func,
    embedding: createEmbedding(func.description)
  })))
})))

Now we’re ready to do the code search! Let's define a function to calculate the similarity between a query and the embedded descriptions. We'll use cosine similarity to do the search:

export function cosineSimilarity(embedding1: number[], embedding2: number[]): number {
    const dotProduct = math.dot(embedding1, embedding2)
    const magnitude1 = math.norm(embedding1)
    const magnitude2 = math.norm(embedding2)
    return dotProduct / (magnitude1 * magnitude2)
}

We’re using mathjs to do the vector calculations.

Let’s create a function to perform a search query:

type QueryResult = {
  score: number,
  func: FunctionInfo
}

export async function runQuery(
    query: string,
    files: FileInfo[]
): Promise<QueryResult[]> {
    // convert input query to embedding
    const queryEmbedding = await createEmbedding(query);

    const similarityScores: QueryResult[] = [];

    for (const file of files) {
        for (const func of file.functions) {
            // calclulate similarity score
            const similarity = cosineSimilarity(queryEmbedding, func.embedding);
            similarityScores.push({
              score: similarity, 
              func
            })
        }
    }

    // sort results by score
    const sorted = similarityScores.sort((a, b) => b.score - a.score);

    // return the top 10
    return similarityScores.slice(0, 10)
}

Boom!💥 We can use runQuery to run natural language queries against our codebase:

const q = "code that sends email"
const results = await runQuery(q, withEmbeddings)

Now, when you’re searching for some code, you can whip out the runQuery function for a better-than-full-text search. The procedure was surprisingly straightforward too!

…but, how do you know the the results are good? What about completeness; what if it didn't catch all the relevant results? It could be the case that the actual best match isn't in the top 10 results.
You could definitely manually take a look at the results and spot check different queries with a codebase you're familiar with, but is there a better way to measure that the results are accurate and complete?

The next article will discuss a method to measure the quality of our search system.

Catch you on the next one!

Author

Account Strength

90%

Account Age

10 years

Verified Email

Yes

Verified Flair

Total Karma

2,369

Link Karma

576

Comment Karma

1,750

Profile updated: 3 days ago

Posts updated: 2 months ago

caleb_dre

Subreddit

r/u_caleb_dre

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 3 months ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/u_caleb_dre...