It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories

Executive Summary

In this article, we present how a coding agent is built based on a large language model (LLM) and how we measure its performance on the Nutanix Cloud Platform solution, showcasing its robust capability in managing intricate workflows.

Introduction

When we discuss coding language models (LLMs) and natural language (NL) language models comparatively, such as Llama3 vs. CodeLlama, we could readily identify some distinctions. In fact, coding LLMs are significantly more challenging to develop and work with compared to NL LLMs for the following reasons.

Precision and Syntax Sensitivity: Code is a formal language with strict syntax rules and structures. A minor error, such as a misplaced bracket or a missing semicolon, can lead to errors that prevent the code from functioning. This requires the LLM to have a high degree of precision and an understanding of syntactic correctness, which is generally more stringent than the flexibility seen in natural language.
Execution Semantics: Code not only needs to be syntactically correct, but it also has to be semantically valid—that is, it needs to perform the function it is supposed to do. Unlike natural language, where the meaning can be implicitly interpreted and still understood even if somewhat imprecisely expressed, code execution needs to yield very specific outcomes. If a code LLM gets the semantics wrong, the program might not work at all or might perform unintended operations.
Context and Dependency Management: Code often involves multiple files or modules that interact with each other, and changes in one part can affect others. Understanding and managing these dependencies and contexts is crucial for a coding LLM, which adds a layer of complexity compared to handling standalone text in natural language.
Variety of Programming Languages: There are many programming languages, each with its own syntax, idioms, and usage contexts. A coding LLM needs to potentially handle multiple languages, understand their unique characteristics, and switch contexts appropriately. This is analogous to a multilingual NL LLM but often with less tolerance for error.
Data Availability and Diversity: While there is a vast amount of natural language data available from books, websites, and other sources, high-quality, annotated programming data can be more limited. Code also lacks the redundancy and variability of natural languages, which can make training more difficult.
Understanding the Underlying Logic: Writing effective code involves understanding algorithms and logic. This requires not only language understanding but also computational thinking, which adds an additional layer of complexity for LLMs designed to generate or interpret code.
Integration and Testing Requirements: For a coding LLM, the generated code often needs to be tested to ensure it works as intended. This involves integrating with software development environments and tools, which is more complex than the generally self-contained process of generating text in natural language.

Each of these aspects makes the development and effective operation of coding LLMs a challenging task, often requiring more specialized knowledge and sophisticated techniques compared to natural language LLMs.

The deployment and life-cycle management of a LLM-serving API is challenging because of the autoregressive nature of the transformer-based generation algorithm. For code LLM, the problem is more acute for the following reasons:

Real-Time Performance: In many applications, coding LLMs are expected to provide real-time assistance to developers, such as for code completion, debugging, or even generating code snippets on the fly. Meeting these performance expectations requires highly efficient models and infrastructure to minimize latency, which can be technically challenging and resource-intensive.
Scalability and Resource Management: Code generation tasks can be computationally expensive, especially when handling complex codebases or generating lengthy code outputs. Efficiently scaling the service to handle multiple concurrent users without degrading performance demands sophisticated resource management and possibly significant computational resources. Also, the attention computation in the inference time takes quadratic time complexity with respect to the input. Often, the input sequence length for the code models are significantly higher than the NL models.
Context Management: Effective code generation often requires understanding not just the immediate code snippet but also broader project contexts, such as libraries used, the overall software architecture, and even the specific project's coding standards. Maintaining and accessing this contextual information in a way that is both accurate and efficient adds complexity to the serving infrastructure.
Security Concerns: Serving a coding LLM involves potential security risks, not only in terms of the security of the model itself (e.g., preventing unauthorized access) but also ensuring that the code it generates does not introduce security vulnerabilities into user projects. Ensuring both model and output security requires rigorous security measures and constant vigilance.

In summary, code LLMs are much harder to train and deploy for inference than NL LLMs. In this article, we cover an API benchmarking for a code generation developed entirely on Nutanix infrastructure.

Code Generation Workflow

Figure 1: Workflow of an LLM-assisted code generation system

Figure 1 shows an LLM-assisted code generation workflow. It combines a context with a prompt with a prompt template to generate the input sequence to a large language model (LLM). Then, the LLM generates the output which is passed to the evaluation system. If the output is satisfactory, the user can revise the prompt, prompt template, and LLM used. Figure 1 shows the taxonomy for the LLM-assisted code generation workflow.

Table 1: Taxonomy for the LLM-assisted code generation workflow

Table 1

Term	Description	Example
Prompt	Instruction to an LLM	Write unit test to the following function
Context	Code body on which the instruction is executed	def two_sum(nums, target): hash_map = {} for index, num in enumerate(nums): difference = target - num if difference in hash_map: return [hash_map[difference], index] hash_map[num] = index return None
Prompt Template	Template to combine prompt and context	<PROMPT> Context: <CONTEXT> Response:
Input	A combination of prompt and context through prompt template	Prompt template with specific prompt and context
LLM	Large Language Model	CodeLlame, Starcoder
Output	Output generated by the LLM	import unittest class TestTwoSum(unittest.TestCase): def test_two_sum_normal(self): self.assertEqual(two_sum([2, 7, 11, 15], 9), [0, 1]) def test_two_sum_no_solution(self): self.assertIsNone(two_sum([1, 2, 3, 4], 10)) def test_two_sum_negative_numbers(self): self.assertEqual(two_sum([-3, 4, 3, 90], 0), [0, 2]) def test_two_sum_same_element_twice(self): self.assertIsNone(two_sum([3, 3], 6)) def test_two_sum_one_element(self): self.assertIsNone(two_sum([3], 3)) def test_two_sum_empty_list(self): self.assertIsNone(two_sum([], 3))
Evaluation	Accuracy assessment by a subject matter expert	Provide feedback on the quality of the generated output and experiment with prompt, prompt template, and/or LLM for a given context.

Nutanix Cloud Platform

At Nutanix, we are dedicated to enabling customers to build and deploy intelligent applications anywhere—edge, core data centers, service provider infrastructure, and public clouds. Figure 2 shows a schematic architecture of Nutanix GPT-in-a-Box 2.0, an enterprise AI platform running on Nutanix cloud platform.

Figure 2: AI stack running on the cloud-native infrastructure stack of NCP.

As shown in Figure 2, the App layer runs on the top of the infrastructure layer of the Nutanix GPT-in-a-Box 2.0 system used in the testing described below. The infrastructure layer can be deployed in two steps, starting with Prism Element console login followed by VM resource configuration. Figure 3 shows the UI for the Prism Element controller.

Figure 3: The UI showing the setup for a Prism Element console on which the transformer model for this article was trained. It shows the AHV hypervisor summary, storage summary, VM summary, hardware summary, monitoring for cluster-wide controller IOPS, monitoring for cluster-wide controller I/O bandwidth, monitoring for cluster-wide controller latency, cluster CPU usage, cluster memory usage, granular health indicators, and data resiliency status.

After logging into Prism Element, we create a virtual machine (VM) hosted on our Nutanix AHV cluster. As shown in Figure 4, the VM has following resource configuration settings: 22.04 Ubuntu operating system, 16 single core vCPUs, 64 GB of RAM, and a NVIDIA A100 tensor core passthrough GPU with 40 GB memory. The GPU is installed with the NVIDIA RTX 15.0 driver for Ubuntu OS (NVIDIA-Linux-x86_64-525.60.13-grid.run). The large deep learning models with transformer architecture require GPU or other compute accelerators with high memory bandwidth, large registers and L1 memory.

Figure 4: The VM resource configuration UI pane on Nutanix Prism Element. As shown, it helps a user configure the number of vCPU(s), the number of cores per vCPUs, memory size (GiB), and GPU choice. We used an NVIDIA A100 80G for this article.

The NVIDIA A100 Tensor Core GPU is designed to power the world’s highest-performing elastic datacenters for AI, data analytics, and HPC. Powered by the NVIDIA Ampere™ architecture, A100 is the engine of the NVIDIA data center platform. A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands.

To peek into the detailed features of A100 GPU, we run `nvidia-smi` command which is a command line utility, based on top of the NVIDIA Management Library (NVML), and intended to aid in the management and monitoring of NVIDIA GPU devices. The output of the `nvidia-smi` command is shown in Figure 6. It shows the Driver Version to be 515.86.01 and CUDA version to be 11.7. Figure 5 shows several critical features of the A100 GPU we used. The details of these features are described in Table 1.

Figure 5: Output of `nvidia-smi` for the underlying A100 GPU

Table 2: Description of the key features of the underlying A100 GPU.

Table 2

Feature	Value	Description
GPU	0	GPU Index
Name	NVIDIA A100	GPU Name
Temp	34C	Core GPU Temperature
Perf	P0	GPU Performance
Persistence-M	On	Persistence Mode
Pwr: Usage/Cap	36W/250W	GPU Power Usage and its capacity
Bus Id	00000000:00:06.0	domain:bus:device.function
Disp. A	Off	Display Active
Memory-Usage	25939MiB/40960MiB	Memory allocation out of total memory
Volatile Uncorr. ECC	0	Counter of uncorrectable ECC memory error
GPU-Util	0%	GPU Utilization
Compute M.	Default	Compute Mode
MIG M.	Disabled	Multi-Instance Mode

Benchmarking Hypothesis

We aim to study the impact of input and output token size on latency, as well as identify any memory or time bottlenecks in the workflow. It is instructive to choose the right code datasets for this benchmarking, and we chose to use code from the GitHub repositories for three popular Python packages: NumPy, PyTorch, and Seaborn. These packages were chosen because their repositories include distinct complexities that could affect the unit test generation.

NumPy is a package for highly optimized array operations. Its codebase includes a wide range of mathematical functions which are relatively straightforward to write unit tests for.
PyTorch is a popular optimized Deep Learning tensor library. Its complexity in model architectures introduces unique challenges in test generation.
Seaborn is a Python data visualization library. Unlike NumPy and PyTorch, Seaborn’s focus on rendering visualizations adds a layer of complexity in terms of testing image outputs.

For the code LLM API, we have used Meta-Llama-3-8B-Instruct. The API server was implemented using FastAPI.

Results

Latency

First, we measured the latency for each of the requests and compared it with the corresponding input/output token counts. Specifically, we measured the following metrics:

Latency: The time elapsed from the moment the API endpoint is called to when the output is received and written to a test file.
Input Token Count: The number of tokens in the API call query.
Output Token Count: The number of tokens in the API call response.

As expected, the latency for all three packages closely fit an exponential distribution (p-value < 0.001). Figure 1 shows the fitted distribution, with the P99 latencies in red.

Figure 6: Latency distribution for all 3 packages. The black line shows the fitted exponential distribution, and the red line denotes P99 latency.

The P99 latency for the NumPy repo appears higher than for the Seaborn and PyTorch repos. This could be explained by the fact that the NumPy input files were on average larger, and had more functions per file, than the PyTorch and Seaborn input files.

Figure 6 shows the correlation matrix among latency, input token count and output token count for each individual package. There is an almost perfect linear correlation between latency and output token count in all cases.

Figure 7: Correlation matrix for the different packages

Figure 7 shows the jointplot between latency and output token count for all 3 repositories. It clearly shows that latency increases with output token count. This proportionality can be explained by the fact that the LMM generates one token at a time.

Figure 8: There is an almost perfect linear correlation between latency and output token count for all 3 packages. There is no statistically significant difference between the regression lines for the 3 packages.

Interestingly, while there is a relatively high correlation between input token counts and latency for the PyTorch and NumPy repos, this is not the case for the Seaborn repo. Given the heavy emphasis on visualization within the Seaborn repository, input token count may not be a good measure of input complexity for the Seaborn repository. Rather, the complexity in unit test generation for Seaborn comes from validating image, rather than textual, output. This complexity remains regardless of input length.

For all three packages, we notice outliers in the latency against input token count graph. Where latency is high for a low input token count, the input file tends to have a large number of utility functions with no doc strings or comments explaining their use (for example, husl.py from the Seaborn repo). Where latency is low for a high input token count, the input file tends to be mostly comments, or lists of configurations and constants that do not need to be unit tested.

Memory Usage

Next, we look at memory usage per line of code during the test generation workflow, in order to find memory bottlenecks in the programme. The memory profiler module was used to log memory usage per line of code for all Python scripts in the PyTorch repository. During the unit test generation workflow, four main functions are called:

generate_test_file
parse_code
run_main_agent
run_combiner_agent

Figure 8 shows the memory usage per line of code for each of these four functions.

Figure 9: Memory usage against line number for test generation. Each line represents a file.

From these graphs, we notice some key bottlenecks. First, line 81 of run_main_agent:

for f in self.extracted_functions:
    agent.generate_direct_vllm(
        context=f, file_name=self.file_name, **kwargs
    )

The memory used here scales linearly with the number of functions extracted from the input file. As a result, files with many function definitions cause the spikes in memory usage observed.

Similar behavior is seen in run_combiner_agent.Memory usage scales linearly with the number of classes, methods and import statements extracted from the file.

Time Complexity

To identify any timing bottlenecks, cProfile was used to profile the timing behavior of unit test generation on the PyTorch repository. The flame graph in Figure 5 describes the relative time spent in different parts of the workflow. As expected, the most time is spent waiting for the vLLM response.

Figure 10: Flame graph showing time spent in different functions. The width of each frame corresponds to the time spent in that function, and the call stack can be recreated by tracing frames upwards.

Combining this with insights from the latency benchmarking, we know that larger test files require more time to be generated.

Insights

The response time varies proportionally with the output token count, and memory usage varies proportionally with the number of classes, methods and import statements in the input file.
On average, the response times for both use cases vary between 0 and 20s.

Conclusion

This article demonstrates how we can benchmark an LLM-based unit test writing API for different open-source repositories. The benchmarking process not only highlights the efficiency and coverage of the generated tests but also provides insights into the strengths and limitations of the LLM in diverse codebases. By systematically evaluating performance metrics such as accuracy, execution time, and test coverage across multiple repositories, we can better understand the contexts in which LLMs excel and where improvements are needed. Future work could focus on refining the model's understanding of complex logic patterns and enhancing its adaptability to various coding styles, ultimately leading to more robust and reliable unit test generation tools.

© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). The third-party products in this article are referenced for demonstration purposes only. Nutanix is not affiliated with, endorsed by, or sponsored by these third-party companies. The use of these third party products is solely for illustrative purposes to demonstrate the features and capabilities of Nutanix's products. This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.

This post may contain express and implied forward-looking statements, which are not historical facts and are instead based on our current expectations, estimates and beliefs. The accuracy of such statements involves risks and uncertainties and depends upon future events, including those that may be beyond our control, and actual results may differ materially and adversely from those anticipated or implied by such statements. Any forward-looking statements included herein speak only as of the date hereof and, except as required by law, we assume no obligation to update or otherwise revise any of such forward-looking statements to reflect subsequent events or circumstances.

It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories