ChatGPT vs Human: The Duel for the Best Programmer Title
For decades, programmers have been writing code for artificial intelligence (AI) models, and now AI is being used to write code itself. A study published in the June issue of IEEE Transactions on Software Engineering evaluated the performance of OpenAI’s ChatGPT 3.5 code generator in terms of functionality, complexity, and security.
The results show that ChatGPT’s success in generating functional code ranges from 0.66% to 89%, depending on the task’s complexity, programming language, and other factors. While AI can sometimes produce code better than humans, the analysis also highlights security issues in AI-generated code.
The research, led by Yutian Tang, a lecturer at the University of Glasgow, found that AI-based code generation can boost productivity and automate software development tasks. However, it’s important to understand the strengths and weaknesses of these models. Tang’s team tested ChatGPT’s ability to solve 728 problems on the LeetCode platform across five programming languages: C, C++, Java, JavaScript, and Python.
Overall, ChatGPT performed well, especially on problems that existed before 2021. For easy, medium, and hard problems, its success rates were about 89%, 71%, and 40%, respectively. However, for problems introduced after 2021, ChatGPT’s ability to generate correct code dropped significantly: from 89% to 52% for easy problems, and from 40% to just 0.66% for hard ones.
This is because ChatGPT was trained on data up to 2021 and hasn’t encountered new problems or solutions since then. It lacks human critical thinking and can only solve problems it has seen before.
Additionally, ChatGPT can generate code that uses less runtime and memory compared to at least 50% of human solutions for the same LeetCode problems. The researchers also examined ChatGPT’s ability to fix its own mistakes after receiving feedback from LeetCode. Out of 50 randomly selected cases where ChatGPT initially generated incorrect code, it was good at fixing compilation errors but not always successful at correcting logical errors.
The study also found that ChatGPT-generated code had vulnerabilities, such as missing null checks, though many of these issues are easy to fix. C code was found to be the most complex, followed by C++ and Python, whose complexity was similar to human-written code.
Yutian Tang notes that to improve ChatGPT’s performance, developers should provide additional information and point out potential vulnerabilities so the AI can better understand tasks and avoid mistakes.
In summary, despite significant progress in using AI for code generation, human oversight and input remain crucial for creating safe and functional software.