Benchmarking Autonomous Software Development Agents: Tasks, Metrics, and Failure Modes

Partha Sarathi Samal , Suresh Kumar Palus and Sai Kiran Padmam , Independent Researcher, USA; Partha Sarathi Samal , Suresh Kumar Palus and Sai Kiran Padmam , Independent Researcher, USA

Benchmarking Autonomous Software Development Agents: Tasks, Metrics, and Failure Modes

Authors

Partha Sarathi Samal , Suresh Kumar Palus and Sai Kiran Padmam , Independent Researcher, USA

Abstract

Autonomous software development agents represent a pivotal shift in how organizations approach coding, testing, and maintenance work. Industry trends project these systems will move from proof of concept toward production deployment within the next 18 to 24 months. Current evaluation frameworks remain fragmented, focusing on isolated task types or single metric dimensions, creating blind spots for practitioners and researchers. This paper introduces DevAgentBench, a comprehensive benchmark suite for autonomous software development agents that spans multiple software development lifecycle phases. DevAgentBench covers four core task families: bug fixing, test generation, refactoring, and code review assistance, plus long-horizon feature tasks that demand planning and coordination. We propose a three-layer metric framework capturing task success, operational reliability, and business-aligned performance. We also present a taxonomy of nine failure-mode categories observed in agent behavior, grounded in real-world agent deployments and existing benchmarks. Finally, we release DevAgentEval, an open-source evaluation framework that enables researchers and tool builders to assess new agents consistently. Baseline experiments across three agent patterns and multiple large language models reveal that no single agent dominates across all task families, and certain failure modes persist regardless of model size.

Keywords

Autonomous agents, agentic AI, software engineering, autonomous software development, code generation, code repair, program repair, bug fixing, test generation, unit testing, continuous integration, continuous delivery, DevOps, MLOps, LLMOps, code review, refactoring, repository scale evaluation, benchmarking, evaluation framework, reliability metrics, cost efficiency, failure modes, safety, security, tool use, planning and reasoning, long-horizon tasks

CS&IT Conference Proceedings

Benchmarking Autonomous Software Development Agents: Tasks, Metrics, and Failure Modes