Relying on AI for Coding: Professionals Face Job Risks and Reliability Concerns
Researchers from the University of California, San Diego have investigated how effectively large language models (LLMs) answer Java-related questions on the Q&A site StackOverflow. Their study found that the quality of AI-generated answers is far from perfect.
In a preprint titled “A Study on the Robustness and Reliability of Large Language Model Code Generation”, graduate students Li Zhong and Zhilong Wang analyzed 1,208 coding questions involving 24 popular Java APIs. They then evaluated answers provided by four different code-capable LLMs using their API-checking tool called RobustAPI.
RobustAPI is designed to assess code reliability. The main assumption is that deviating from API rules can cause problems when running code in production environments. The researchers noted that code testing—whether written by humans or machines—often focuses only on semantic correctness and overlooks potential unexpected inputs.
For example, they pointed out a code snippet that should have been placed inside a try-catch block to handle errors:
RandomAccessFile raf = new RandomAccessFile("/tmp/file.json", "r"); byte[] buffer = new byte[1024 * 1024]; int bytesRead = raf.read(buffer, 0, buffer.length); raf.close();
The team tested GPT-3.5 and GPT-4 from OpenAI, as well as two open-source models: Llama 2 from Meta and Vicuna-1.5 from the Large Model Systems Organization. They used three different test scenarios for their question set:
- Zero-shot: No example of correct API usage was provided in the prompt.
- One-shot-irrelevant: An unrelated example was provided.
- One-shot-relevant: A correct example of API usage was included in the prompt.
Overall, the models showed the following API misuse rates in the zero-shot test:
- GPT-3.5: 49.83%
- GPT-4: 62.09%
- Llama 2: 0.66%
- Vicuna-1.5: 16.97%
“According to our evaluation, all tested models suffer from API misuse issues, even advanced commercial models like GPT-3.5 and GPT-4,” Zhong and Wang noted. “In the zero-shot setting, most of Llama’s answers didn’t include code at all, which explains its lowest error rate.”
In contrast, they said, GPT-4 had a higher API misuse rate than GPT-3.5 because, as OpenAI claims, it is more likely to respond with code. The problem is that GPT-4’s answers are not necessarily correct.
For the one-shot-irrelevant test, the error rates were:
- GPT-3.5: 62%
- GPT-4: 64.34%
- Llama 2: 49.17%
- Vicuna-1.5: 48.51%
“Llama’s API misuse rate increased significantly after adding an irrelevant example, as more valid answers included code snippets,” the researchers said. “Overall, adding an irrelevant example encourages LLMs to generate more valid answers, which allows for better assessment of code reliability and robustness.”
When the models were given a correct example of API usage in the one-shot-relevant test, the results were:
- GPT-3.5: 31.13%
- GPT-4: 49.17%
- Llama 2: 47.02%
- Vicuna-1.5: 27.32%
This shows that clarifying the requirements leads to fewer API misuse errors.
However, the researchers conclude that there is still much work to be done, as simply generating code is not enough. The code must also be reliable, which remains a significant challenge.
“Although the code generation capabilities of large language models have improved significantly, the reliability and robustness of code in real-world production environments remain problematic. There is tremendous room for improvement in this area,” Zhong and Wang emphasized.