Aider's refactoring benchmark exercises based on popular python repos
https://github.com/Aider-AI/refactor-benchmark.git
This repository holds exercises for a coding benchmark used by the aider AI coding tool. This benchmark was designed to provoke "lazy coding" in the GPT-4 Turbo models, which have this widely reported problem.
This benchmarked assisted in the design and evaluation of a solution to the lazy coding problem. Asking GPT-4 Turbo to format code changes as unified diffs reduced lazy coding by 3X.
Aider has long used a [benchmark suite based on 133 Exercism python exercises](). But these are mostly small coding problems, usually requiring only a few dozen lines of code. GPT-4 Turbo is typically only lazy on 2-3 of these exercises: the ones with the most code and which involve refactoring.
Based on this observation, I set out to build a benchmark based on refactoring
a non-trivial amount of code found in fairly large files.
To do this, I used python's ast module to analyze
9 popular open source python repositories
to identify challenging refactoring tasks.
The goal was to find:
self parameter, so they can be trivially refactored out of the class.Refactor the_set_csrf_cookiemethod in theCsrfViewMiddlewareclass to be a stand alone, top level function.
Name the new function_set_csrf_cookie, exactly the same name as the existing method.
Update any existingself._set_csrf_cookiecalls to work with the new_set_csrf_cookiefunction.
A simple python AST scanning script found 89 suitable files and packaged them up as benchmark tasks. Each task has a test that checks if the refactor was performed roughly correctly:
The result is a pragmatic benchmark suite that provokes, detects and quantifies GPT coding laziness.
The refactoring exercises are based on code from the following repositories: