Focus solely on evaluation metrics and benchmark details without any model distractions. Access comprehensive tool performance insights and comparisons easily.
SOFTWARE ENGINEERING
REAL GITHUB ISSUES
CODE TESTING
STATEFUL
INTERACTIVE
TOOL DEPENDENCIES
MULTI-ENVIRONMENT
DECISION MAKING
REASONING
AGENT RELIABILITY
REAL-WORLD SCENARIOS
MULTI-TURN DIALOGUE
TOOL CALLING
REAL-WORLD APIS
MULTI-TURN
AST EVALUATION