In February, Meta researchers published a paper titled Automated Unit Test Improvement using Large Language Models at Meta, which introduces a tool they called TestGen-LLM. The fully automated approach to increasing test coverage “with guaranteed assurances for improvement over the existing code base” created waves in the software engineering world.
Meta didn’t release the TestGen-LLM code, so we decided to implement it as part of our open-source Cover Agent4.3K and we’re releasing it today!
I’ll share some information here on how we went about implementing it, share some of our findings and outline the challenges we encountered when actually using TestGen-LLM with real-world codebases.
Automated Test Generation: Baseline Criteria
Automated test generation using Generative AI is nothing new. Most LLMs that are competent at generating code, such as ChatGPT, Gemini, and Code Llama, are capable of generating tests. The most common pitfall that developers run into when generating tests with LLMs is that most generated tests don’t even work and many don’t add value (e.g. they test the same functionality already covered by other tests).
To overcome this challenge (specifically, for regression unit tests) the TestGen-LLM authors came up with the following criteria:
- Does the test compile and run properly?
- Does the test increase code coverage?
Without answering these two fundamental questions, arguably, there’s no point in accepting or analyzing the generated test provided to us by the LLM.
Once we’ve validated that the tests are capable of running correctly and that they increase the coverage of our component under test, we can start to investigate (in a manual review):
- How well is the test written?
- How much value does it actually add? (We all know that sometimes Code Coverage could be a proxy or even vanity metric)
- Does it meet any additional requirements that we may have?
Approach and reported results
TestGen-LLM (and Qodo Cover) run completely headless (well, kind of; we will discuss this later).
First, TestGen-LLM generates a bunch of tests, then it filters out those that don’t build/run and drops any that don’t pass, and finally, it discards those that don’t increase the code coverage. In highly controlled cases, the ratio of generated tests to those that pass all of the steps is 1:4, and in real-world scenarios, Meta’s authors report a 1:20 ratio.
Following the automated process, Meta had a human reviewer accept or reject tests. The authors reported an average acceptance ratio of 1:2, with a 73% acceptance rate in their best reported cases.
It is important to note that the TestGen-LLM tool, as described in the paper, generates on each run a single test that is added to an existing test suite, written previously by a professional developer. Moreover, it doesn’t necessarily generate tests for any given test suite.
From the paper: “In total, over the three test-a-thons, 196 test classes were successfully improved, while the TestGen-LLM tool was applied to a total of 1,979 test classes. TestGen-LLM was therefore able to automatically improve approximately 10% of the test classes to which it was applied.”
Qodo Cover v0.1 is implemented as follows:
- Receive the following user inputs:
- Source File for code under test
- Existing Test Suite to enhance
- Coverage Report
- The command for building and running test suite
- Code coverage target and maximum iterations to run
- Additional context and prompting options
- Generate more tests in the same style
- Validate those tests using your runtime environment
- Do they build and pass?
- Ensure that the tests add value by reviewing metrics such as increased code coverage
- Update existing Test Suite and Coverage Report
- Repeat until code reaches criteria: either code coverage threshold met, or reached the maximum number of iterations
Challenges we encountered when implementing and reviewing TestGen-LLM
As we worked on putting the TestGen-LLM paper into practice, we ran into some surprising challenges.
The examples presented in the paper mention using Kotlin for writing tests – a language that doesn’t use significant whitespace. With languages like Python on the other hand, tabs and spaces are not only important but a requirement for the parsing engine. Less sophisticated models, such as GPT 3.5, won’t return code that is consistently indented properly, even when explicitly prompted. An example of where this causes issues is a test class written in Python that requires each test function to be indented. We had to account for this throughout our development lifecycle which added more complexity around pre-processing libraries. There is still plenty to improve on in order to make Qodo Cover robust in scenarios like this.
After seeing the special test requirements and exceptions we encountered during our trials, we decided to give the user the ability to provide additional input or instructions to prompt the LLM as part of the Qodo Cover flow. The `–additional-instructions` option allows developers to provide any extra information that’s specific to their project, empowering them to customize Qodo Cover. These instructions can be used, for example, to steer Qodo Cover to create a rich set of tests with meaningful edge cases.
Concurring with the general trend of Retrieval-Augmented Generation (RAG) becoming more pervasive in AI based applications, we identified that having more context to go along with unit test generation enables higher quality tests and a higher passing rate. We’ve provided the `–included-files` option to users who want to manually add additional libraries or text-based design documents as context for the LLM to enhance the test generation process.
Complex code that required multiple iterations presented another challenge to the LLMs. As the failed (or non-value added) tests were generated, we started to notice a pattern where the same non-accepted tests were repeatedly suggested in later iterations. To combat this we added a “Failed Tests” section to the prompt to deliver that feedback to the LLM and ensure it generated unique tests and never repeated tests that we deemed unusable (i.e. broken or lack of coverage increase).
Another challenge that came up throughout this process was the inability to add library imports when extending an existing test suite. Developers can sometimes be myopic in their test generation process, only using a single approach to testing frameworks. In addition to many different mocking frameworks, other libraries can help with achieving test coverage. Since the TestGen-LLM paper (and Qodo Cover) are intended to extend existing test suites, the ability to completely restructure the whole test class is out of scope. This is, in my opinion, a limitation of test extension versus test generation and something we plan on addressing in future iterations.
It’s important to make the distinction that in TestGen-LLM’s approach, each test required a manual review from the developer before the next test is suggested. In Qodo Cover on the other hand, we generate, validate, and propose as many tests as possible until achieving the coverage requirement (or stopping at the max iterations), without requiring manual intervention throughout the process. We leverage AI to run in the background, creating an unobtrusive approach to automatic test generation that allows the developer to review the entire test suite once the process has completed.
Conclusion and what’s next
While many, including myself, are excited about the TestGen-LLM paper and tool, in this post we have shared its limitations. I believe that we are still in the era of AI assistants and not AI teammates who run fully automated workflows.
At the same time, well-engineered flows, which we plan to develop and share here in Qodo Cover, can help us developers automatically generate test candidates, and increase code coverage in a fraction of the time.
We intend to continue developing and integrating cutting-edge methods related to the test generation domain into the Qodo Cover open-source repo.
We encourage anyone interested in generative AI for testing to collaborate and help extend the capabilities of Cover Agent, and we hope to inspire researchers to leverage this open-source tool to explore new test-generation techniques.
In the open-source Qodo Cover repo on GitHub we’ve added a development roadmap. We would love to see you contributing to the repo according to the roadmap or according to your own ideas!
Our vision for Qodo Cover is that in the future it will run automatically for every pre/post-pull request and automatically suggest regression test enhancements that have been validated to work and increase code coverage. We envision that Qodo Cover will automatically scan your codebase, and open PRs with test suites for you.
Let’s leverage AI to help us deal more efficiently with the tasks we don’t like doing!
P.S.
- We are still looking for a good benchmark for tools like this. Do you know of one? We think it is critical for further development and research.
- Check out our qodo Flow work for (a) further reading on “Flow Engineering”, as well as an example of (b) a competitive programming benchmark, and (c) a well-designed dataset called CodeContests.