SolidityBench by IQ has recently launched as the first leaderboard designed to evaluate Language Model Models (LLMs) in Solidity code generation. This innovative tool, available on Hugging Face, introduces two unique benchmarks – NaïveJudge and HumanEval for Solidity. These benchmarks are specifically tailored to assess and rank the proficiency of AI models in generating smart contract code.
Developed by IQ’s BrainDAO as part of its upcoming IQ Code suite, SolidityBench aims to refine their EVMind LLMs while comparing them against other generalist and community-created models. The primary focus of IQ Code is to offer AI models that are specialized in generating and auditing smart contract code, addressing the increasing demand for secure and efficient blockchain applications.
NaïveJudge, a key component of SolidityBench, challenges LLMs to implement smart contracts based on detailed specifications derived from audited OpenZeppelin contracts, which serve as a benchmark for correctness and efficiency. The generated code undergoes evaluation against a reference implementation using criteria such as functional completeness, adherence to Solidity best practices, security standards, and optimization efficiency.
To assess the code, advanced LLMs including different versions of OpenAI’s GPT-4 and Claude 3.5 Sonnet act as impartial code reviewers. They evaluate the code based on various criteria such as implementing key functionalities, handling edge cases, error management, proper syntax usage, code structure, and maintainability. Additionally, optimization factors like gas efficiency and storage management are also taken into consideration, providing a comprehensive assessment across functionality, security, and efficiency.
In the benchmarking results, OpenAI’s GPT-4o model emerged as the top performer with an overall score of 80.05, followed closely by newer reasoning models like OpenAI’s o1-preview and o1-mini. Models from Anthropic and XAI also demonstrated competitive performance, while Nvidia’s Llama-3.1-Nemotron-70B scored the lowest in the top 10.
HumanEval for Solidity, adapted from OpenAI’s original HumanEval benchmark, encompasses 25 tasks of varying difficulty in Solidity. Each task includes corresponding tests compatible with Hardhat, facilitating accurate compilation and testing of generated code. The evaluation metrics, pass@1 and pass@3, measure the model’s success on initial attempts and over multiple tries, offering insights into precision and problem-solving capabilities.
By introducing these benchmarks, SolidityBench aims to advance AI-assisted smart contract development, promote best practices, and drive the continuous refinement of AI models in the blockchain ecosystem. Developers, researchers, and AI enthusiasts are encouraged to explore and contribute to SolidityBench on Hugging Face to further enhance the capabilities of AI models in smart contract development.
Visit the SolidityBench leaderboard on Hugging Face to learn more and begin benchmarking Solidity generation models.