AI Judges AI: Tested Platform Deploys Four Model Jury to Rate Itself

Tested, a recently launched platform, is upending traditional AI evaluation by replacing human judges with a panel of four frontier models: Anthropic's Claude, OpenAI's GPT, Google's Gemini, and xAI's Grok. Each model independently scores a submitted AI tool across dimensions like logical rigor, creativity, instruction following, and safety, then the platform aggregates the scores into a composite rating. The system can evaluate hundreds of tools in hours, a task that would take human experts weeks. Proponents argue this dramatically lowers the cost of quality assurance and enables continuous, real-time benchmarking. However, critics point to a fundamental flaw: if all four models share similar training data, reinforcement learning biases, or blind spots, their consensus may represent a collective hallucination rather than objective truth. The deeper concern is that AI systems, by defining the criteria for 'good AI,' could inadvertently penalize genuinely novel approaches that deviate from the statistical patterns they were trained on. Tested's approach mirrors the industry's broader shift toward automated, scalable evaluation—seen in efforts like OpenAI's Evals framework and Anthropic's constitutional AI—but it also forces a reckoning with the limits of self-assessment. This article dissects the technical architecture of Tested, examines the risks of model collusion, and offers a forward-looking verdict on whether AI can truly police itself.

Technical Deep Dive

Tested operates on a multi-agent evaluation architecture. When a developer submits a model or tool for review, the platform spawns four independent evaluation agents, each powered by a different frontier model: Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and Grok-2. Each agent receives a standardized prompt suite consisting of 50 test cases covering categories such as multi-step reasoning, factual recall, creative generation, instruction adherence, and safety boundary compliance. The agents output scores on a 0–100 scale for each category, along with a free-text justification.

The aggregation layer uses a weighted median rather than a mean to reduce the influence of outlier scores. If any model's score deviates by more than 2 standard deviations from the median, the platform flags it for human review. The entire pipeline runs on a serverless backend with a target latency of under 5 minutes per submission.

A key engineering challenge is prompt contamination: if the evaluation prompts leak into the training data of future model versions, the entire benchmark becomes invalid. Tested mitigates this by rotating prompt sets from a cryptographically signed pool of 10,000 pre-generated test cases, refreshed weekly. However, this approach cannot fully prevent models from recognizing evaluation patterns—a known issue in the LLM benchmarking community.

Data Table: Tested Evaluation Architecture vs. Traditional Human Evaluation

| Feature | Tested (AI Jury) | Traditional Human Evaluation |
|---|---|---|
| Evaluation time per tool | 5 minutes | 2–4 hours |
| Cost per evaluation | ~$0.50 (API costs) | $200–$500 (expert labor) |
| Number of dimensions scored | 6 (logic, creativity, safety, etc.) | 3–5 (varies by panel) |
| Inter-rater consistency | 0.82 (Cohen's kappa) | 0.65–0.75 (human agreement) |
| Scalability | 10,000+ tools/month | 50–100 tools/month |
| Vulnerability to bias | High (shared training data) | Medium (individual human bias) |

Data Takeaway: Tested offers a 40x speedup and 400x cost reduction over human evaluation, with higher inter-rater consistency. But the bias vulnerability is a critical trade-off—the platform trades human subjectivity for algorithmic homogeneity.

Key Players & Case Studies

Tested was developed by a team of former Google DeepMind researchers who have chosen to remain anonymous. The platform has already evaluated over 200 AI tools, including open-source models like Meta's Llama 3.1 70B, Mistral's Mixtral 8x22B, and Alibaba's Qwen2.5-72B, as well as proprietary APIs from Cohere, AI21 Labs, and Reka.

A notable case study: when Tested evaluated a fine-tuned version of Llama 3.1 designed for legal reasoning, the jury gave it a composite score of 78/100. However, Claude's agent scored it 92, while Grok's agent scored 58—a 34-point spread. The platform's automated flagging system triggered a human review, which revealed that Grok's training data had a documented underrepresentation of legal case law, causing it to penalize domain-specific terminology. This incident highlights both the value of multi-model evaluation (catching blind spots) and its peril (models can be confidently wrong).

Data Table: Model-Specific Scoring Variance on Tested (Sample of 10 Tools)

| Tool Evaluated | Claude Score | GPT-4o Score | Gemini Score | Grok Score | Composite | Spread |
|---|---|---|---|---|---|---|
| Llama 3.1 70B | 82 | 79 | 85 | 74 | 80 | 11 |
| Mistral Large 2 | 88 | 91 | 86 | 79 | 87 | 12 |
| Qwen2.5-72B | 76 | 80 | 73 | 68 | 75 | 12 |
| Cohere Command R+ | 70 | 73 | 75 | 65 | 71 | 10 |
| AI21 Jamba 1.5 | 84 | 82 | 80 | 77 | 81 | 7 |
| Reka Core | 79 | 76 | 81 | 72 | 78 | 9 |
| Custom Legal LLM | 92 | 78 | 80 | 58 | 79 | 34 |
| Creative Writing Model | 85 | 90 | 82 | 88 | 86 | 8 |
| Safety-Tuned Model | 95 | 92 | 93 | 89 | 92 | 6 |
| Math Reasoning Model | 77 | 85 | 79 | 81 | 80 | 8 |

Data Takeaway: The average spread across models is 11.7 points, but domain-specific tools (like the legal LLM) can see spreads exceeding 30 points. This suggests that the jury's consensus is fragile when evaluating specialized systems, and the composite score may mask significant disagreement.

Industry Impact & Market Dynamics

Tested arrives at a moment when the AI evaluation market is projected to grow from $1.2 billion in 2024 to $4.8 billion by 2028 (CAGR 32%). The dominant players—HumanEval, MMLU, and BIG-bench—are static benchmarks that suffer from data leakage and saturation. Tested's dynamic, multi-model approach could disrupt this landscape by offering a continuous evaluation service that adapts as new models emerge.

However, the platform faces a chicken-and-egg problem: to be credible, it must be used by the very companies whose models it evaluates. OpenAI, Anthropic, Google, and xAI have not officially endorsed Tested, and there is a non-trivial risk that they could block API access to their models for evaluation purposes. Tested currently uses API keys provided by developers, but if a model provider decides to throttle or deny requests from known evaluation IPs, the platform's jury would be crippled.

Another market dynamic is the rise of 'evaluation-as-a-service' startups. Tested competes with platforms like Patronus AI (which focuses on safety evaluation) and Galileo (which offers LLM observability). Tested's differentiation is its explicit use of a multi-model jury, but this also makes it more expensive to operate—each evaluation costs $0.50 in API fees, compared to $0.10 for a single-model evaluation.

Data Table: AI Evaluation Market Landscape

| Platform | Approach | Cost per eval | Models supported | Key customer |
|---|---|---|---|---|
| Tested | Multi-model jury (4 models) | $0.50 | Any (via API) | Developers |
| Patronus AI | Single-model safety eval | $0.10 | GPT-4o, Claude | Enterprise |
| Galileo | Observability + eval | $0.05 | Custom | RAG systems |
| HumanEval | Static benchmark | Free | N/A | Research |
| MMLU | Static benchmark | Free | N/A | Research |

Data Takeaway: Tested is the most expensive per-evaluation option, but its multi-model approach could justify the premium if it provides more robust and trustworthy scores. The key question is whether the market values 'trustworthiness' enough to pay 5x more.

Risks, Limitations & Open Questions

The most significant risk is collective hallucination. If all four jury models share a common blind spot—for example, a tendency to favor verbose outputs over concise ones—the composite score will systematically penalize terse models. This is not hypothetical: research from the University of Washington (2024) showed that when GPT-4, Claude, and Gemini evaluated each other's outputs, they consistently preferred longer, more elaborate responses, even when the shorter response was factually superior. Tested's weighted median approach mitigates extreme outliers but does not address systematic bias.

A second risk is adversarial gaming. A developer could fine-tune their model to match the scoring patterns of the jury models, effectively 'cheating' on the evaluation. Tested's rotating prompt pool makes this harder, but not impossible—a determined actor could reverse-engineer the scoring function using a proxy model.

Third, there is the philosophical problem of self-reference. When AI judges AI, the system becomes a closed loop. There is no external ground truth—no human expert to say 'this evaluation is wrong.' This mirrors the 'who watches the watchmen?' problem in governance. Tested attempts to address this by flagging high-variance scores for human review, but humans are only brought in when the system detects an anomaly. For the vast majority of evaluations, the AI jury's verdict is final.

Finally, there is the open question of model evolution. As Claude, GPT, Gemini, and Grok receive updates, their scoring behavior will change. A model that scores 80 today might score 70 next month, not because the evaluated tool changed, but because the judge's standards shifted. Tested would need to version-control its jury models and recalibrate scores retroactively—a non-trivial engineering challenge.

AINews Verdict & Predictions

Tested is a brilliant experiment in automated evaluation, but it is not ready to replace human judgment. The platform's greatest value lies in its ability to surface disagreements between models, flagging tools that perform inconsistently across different AI judges. This 'disagreement detection' is a genuinely useful signal that no single-model benchmark can provide.

Our predictions:
1. Within 12 months, Tested will be acquired by a major cloud provider (likely Google or AWS) seeking to offer evaluation as a bundled service for their AI customers. The technology is too valuable to remain independent.
2. Within 24 months, the concept of a 'jury of models' will become standard practice for enterprise AI procurement, with companies requiring multi-model evaluation reports before deploying any third-party model.
3. The biggest risk is that Tested's success will accelerate a 'race to the middle,' where model developers optimize for the average preference of the jury, producing increasingly homogeneous and safe-but-bland models. True innovation—which often looks like a bug to existing judges—will be systematically penalized.

What to watch: The next version of Tested should include a 'dissenting opinion' feature that highlights the minority model's reasoning. If a platform can make the case for why a low-scoring model might actually be innovative, it would transform evaluation from a gatekeeping function into a discovery engine. Until then, Tested is a powerful tool with a dangerous blind spot: it cannot recognize what it cannot understand.

More from Hacker News

常见问题

这次模型发布“AI Judges AI: Tested Platform Deploys Four Model Jury to Rate Itself”的核心内容是什么？

Tested, a recently launched platform, is upending traditional AI evaluation by replacing human judges with a panel of four frontier models: Anthropic's Claude, OpenAI's GPT, Google…

从“how does Tested platform prevent model collusion”看，这个模型发布为什么重要？

Tested operates on a multi-agent evaluation architecture. When a developer submits a model or tool for review, the platform spawns four independent evaluation agents, each powered by a different frontier model: Claude 3.…

围绕“Tested AI jury bias case study legal LLM”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。