OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards
https://github.com/agentscope-ai/OpenJudge.git
π If you find OpenJudge helpful, please give us a Star! π
π Website | π Try Online | π Documentation | π€ Contributing | δΈζ
OpenJudge is an open-source evaluation framework for AI applications (e.g., AI agents or chatbots) designed to evaluate quality and drive continuous application optimization.
In practice, application excellence depends on a trustworthy evaluation workflow: Collect test data β Define graders β Run evaluation at scale β Analyze weaknesses β Iterate quickly.
OpenJudge provides ready-to-use graders and supports generating scenario-specific rubrics (as graders), making this workflow simpler, more professional, and easy to integrate into your workflow. It can also convert grading results into reward signals to help you fine-tune and optimize your application.
π Try it now! Visit openjudge.me/app to use graders online β no installation required. Test built-in graders, build custom rubrics, and explore evaluation results directly in your browser.
streamlit run ui/app.pyAccess 50+ production-ready graders featuring a comprehensive taxonomy, rigorously validated for reliable performance.
π― GeneralFocus: Semantic quality, functional correctness, structural compliance Key Graders:
|
π€ AgentFocus: Agent lifecycle, tool calling, memory, plan feasibility, trajectory quality Key Graders:
|
πΌοΈ MultimodalFocus: Image-text coherence, visual generation quality, image helpfulness Key Graders:
|
Using mainstream observability platforms like LangSmith or Langfuse? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We also provide integrations with training frameworks like VERL for RL training. π See Integrations for details
Explore OpenJudge without writing a single line of code. Our online platform at openjudge.me/app lets you:
π‘ Don't want to install anything? Try OpenJudge online β use graders directly in your browser, no setup needed.
pip install py-openjudge
π‘ More installation methods can be found in the Quickstart Guide.
π Complete Quickstart can be found in the Quickstart Guide.
A simple example to evaluate a single response:
import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common.relevance import RelevanceGrader
async def main():
# 1οΈβ£ Create model client
model = OpenAIChatModel(model="qwen3-32b")
# 2οΈβ£ Initialize grader
grader = RelevanceGrader(model=model)
# 3οΈβ£ Prepare data
data = {
"query": "What is machine learning?",
"response": "Machine learning is a subset of AI that enables computers to learn from data.",
}
# 4οΈβ£ Evaluate
result = await grader.aevaluate(**data)
print(f"Score: {result.score}") # Score: 4
print(f"Reason: {result.reason}")
if __name__ == "__main__":
asyncio.run(main())
Use multiple built-in graders to comprehensively evaluate your LLM application: π Explore All built-in graders
Business Scenario: Evaluating an e-commerce customer service agent that handles order inquiries. We assess the agent's performance across three dimensions: relevance, hallucination, and tool selection.
import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common import RelevanceGrader, HallucinationGrader
from openjudge.graders.agent.tool.tool_selection import ToolSelectionGrader
from openjudge.runner import GradingRunner
from openjudge.runner.aggregator import WeightedSumAggregator
from openjudge.analyzer.statistical import DistributionAnalyzer
TOOL_DEFINITIONS = [
{"name": "query_order", "description": "Query order status and logistics information", "parameters": {"order_id": "str"}},
{"name": "query_logistics", "description": "Query detailed logistics tracking", "parameters": {"order_id": "str"}},
{"name": "estimate_delivery", "description": "Estimate delivery time", "parameters": {"order_id": "str"}},
]
# Prepare your dataset
dataset = [{
"query": "Where is my order ORD123456?",
"response": "Your order ORD123456 has arrived at the Beijing distribution center and is expected to arrive tomorrow.",
"context": "Order ORD123456: Arrived at Beijing distribution center, expected to arrive tomorrow.",
"tool_definitions": TOOL_DEFINITIONS,
"tool_calls": [{"name": "query_order", "arguments": {"order_id": "ORD123456"}}],
# ... more test cases
}]
async def main():
# 1οΈβ£ Initialize judge model
model = OpenAIChatModel(model="qwen3-max")
# 2οΈβ£ Configure multiple graders
grader_configs = {
"relevance": {"grader": RelevanceGrader(model=model), "mapper": {"query": "query", "response": "response"}},
"hallucination": {"grader": HallucinationGrader(model=model), "mapper": {"query": "query", "response": "response", "context": "context"}},
"tool_selection": {"grader": ToolSelectionGrader(model=model), "mapper": {"query": "query", "tool_definitions": "tool_definitions", "tool_calls": "tool_calls"}},
}
# 3οΈβ£ Set up aggregator for overall score
aggregator = WeightedSumAggregator(name="overall_score", weights={"relevance": 0.3, "hallucination": 0.4, "tool_selection": 0.3})
# 4οΈβ£ Run evaluation
results = await GradingRunner(grader_configs=grader_configs, aggregators=[aggregator], max_concurrency=5).arun(dataset)
# 5οΈβ£ Generate evaluation report
overall_stats = DistributionAnalyzer().analyze(dataset, results["overall_score"])
print(f"{'Overall Score':<20} | {overall_stats.mean:>15.2f}")
if __name__ == "__main__":
asyncio.run(main())
Generate a custom grader from task description without labeled data: π Zero-shot Rubrics Generation Guide
When to use: Quick prototyping when you have no labeled data but can clearly describe your task.
import asyncio
from openjudge.generator.simple_rubric import SimpleRubricsGenerator, SimpleRubricsGeneratorConfig
from openjudge.models import OpenAIChatModel
async def main():
# 1οΈβ£ Configure generator
config = SimpleRubricsGeneratorConfig(
grader_name="customer_service_grader",
model=OpenAIChatModel(model="qwen3-max"),
task_description="E-commerce AI customer service primarily handles order inquiry tasks (such as logistics status and ETA) while focusing on managing customer emotions.",
min_score=1,
max_score=3,
)
# 2οΈβ£ Generate grader
generator = SimpleRubricsGenerator(config)
grader = await generator.generate(dataset=[], sample_queries=[])
# 3οΈβ£ View generated rubrics
print("Generated Rubrics:", grader.kwargs.get("rubrics"))
# 4οΈβ£ Use the grader
result = await grader.aevaluate(
query="My order is delayed, what should I do?",
response="I understand your concern. Let me check your order status..."
)
print(f"\nScore: {result.score}/3\nReason: {result.reason}")
if __name__ == "__main__":
asyncio.run(main())
Learn evaluation criteria from labeled examples: π Data-driven Rubrics Generation Guide
When to use: You have labeled data and need high-accuracy graders for production use, especially when evaluation criteria are implicit.
import asyncio
from openjudge.generator.iterative_rubric.generator import IterativeRubricsGenerator, IterativePointwiseRubricsGeneratorConfig
from openjudge.models import OpenAIChatModel
from openjudge.models.schema.prompt_template import LanguageEnum
# Prepare labeled dataset (simplified example, recommend 10+ samples in practice)
labeled_dataset = [
{"query": "My order hasn't arrived after 10 days, I want to complain!", "response": "I sincerely apologize for the delay. I completely understand your frustration! Your order was delayed due to weather conditions, but it has now resumed shipping and is expected to arrive tomorrow. I've marked it for priority delivery.", "label_score": 5},
{"query": "Where is my package? I need it urgently!", "response": "I understand your urgency! Your package is currently out for delivery and is expected to arrive before 2 PM today. The delivery driver's contact number is 138xxxx.", "label_score": 5},
{"query": "Why hasn't my order arrived yet? I've been waiting for days!", "response": "Your order is expected to arrive the day after tomorrow.", "label_score": 2},
{"query": "The logistics hasn't updated in 3 days, is it lost?", "response": "Hello, your package is not lost. It's still in transit, please wait patiently.", "label_score": 3},
# ... more labeled examples
]
async def main():
# 1οΈβ£ Configure generator
config = IterativePointwiseRubricsGeneratorConfig(
grader_name="customer_service_grader_v2", model=OpenAIChatModel(model="qwen3-max"),
min_score=1, max_score=5,
enable_categorization=True, categories_number=5, # Enable categorization, Aggregate into 5 themes
)
# 2οΈβ£ Generate grader from labeled data
generator = IterativeRubricsGenerator(config)
grader = await generator.generate(labeled_dataset)
# 3οΈβ£ View learned rubrics
print("\nLearned Rubrics from Labeled Data:\n",grader.kwargs.get("rubrics", "No rubrics generated"))
# 4οΈβ£ Evaluate new samples
test_cases = [
{"query": "My order hasn't moved in 5 days, can you check? I'm a bit worried", "response": "I understand your concern! Let me check immediately: Your package is currently at XX distribution center. Due to recent high order volume, there's a slight delay, but it's expected to arrive the day after tomorrow. I'll proactively contact you if there are any issues."},
{"query": "Why is this delivery so slow? I'm waiting to use it!", "response": "Checking, please wait."},
]
print("\n" + "=" * 70, "\nEvaluation Results:\n", "=" * 70)
for i, case in enumerate(test_cases):
result = await grader.aevaluate(query=case["query"], response=case["response"])
print(f"\n[Test {i+1}]\n Query: {case['query']}\n Response: {case['response']}\n Score: {result.score}/5\n Reason: {result.reason[:200]}...")
if __name__ == "__main__":
asyncio.run(main())
Seamlessly connect OpenJudge with mainstream observability and training platforms:
| Category | Platform | Status | Documentation |
|---|---|---|---|
| Observability | LangSmith | β Available | π LangSmith Integration Guide |
| Langfuse | β Available | π Langfuse Integration Guide | |
| Other frameworks | π΅ Planned | β | |
| Training | verl | β Available | π VERL Integration Guide |
| Trinity-RFT | π΅ Planned | β |
π¬ Have a framework you'd like us to prioritize? Open an Issue!
We love your input! We want to make contributing to OpenJudge as easy and transparent as possible.
π¨ Adding New Graders β Have domain-specific evaluation logic? Share it with the community!
π Reporting Bugs β Found a glitch? Help us fix it by opening an issue
π Improving Docs β Clearer explanations or better examples are always welcome
π‘ Proposing Features β Have ideas for new integrations? Let's discuss!
π See full Contributing Guidelines for coding standards and PR process.
Join our DingTalk group to connect with the community:
OpenJudge was previously distributed as the legacy packagerm-gallery(v0.1.x). Starting from v0.2.0, it is published aspy-openjudgeand the Python import namespace isopenjudge.
OpenJudge v0.2.0 is NOT backward compatible with v0.1.x. If you are currently using v0.1.x, choose one of the following paths:
pip install rm-gallery
We preserved the source code of v0.1.7 (the latest v0.1.x release) in the v0.1.7-legacy branch.
If you use OpenJudge in your research, please cite:
@software{
title = {OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards},
author = {The OpenJudge Team},
url = {https://github.com/agentscope-ai/OpenJudge},
month = {07},
year = {2025}
}
Made with β€οΈ by the OpenJudge Team
π Website Β· π Try Online Β· β Star Us Β· π Report Bug Β· π‘ Request Feature