How do you measure an AI agent that lives in a multi-party conversation — a Slack channel, a Discord server, a group thread — rather than a tidy one-on-one assistant exchange? This compendium gathers original research on that question, produced in a single focused session and verified finding by finding.
It has two halves, and they answer two different questions.
The two reports
How the field measures group-chat agents today →
A literature survey of the current state of group-chat agent evaluation: the benchmarks, the datasets, and the six recurring eval mechanisms — addressee accuracy, when-to-speak F1, conversation disentanglement, simulated-user rollouts graded by state-diff, LLM-as-judge social rubrics, and information leakage.
The headline: the distinctive group-chat skills are demonstrably unsolved. Frontier models sit near chance on addressee recognition, near-random on when-to-speak timing, and the best multi-party memory system scores 46%.
What the science says we should measure →
The layer the LLM-eval literature doesn't reach. Drawing on roughly fifty years of research into how human group conversation actually works — conversation analysis, the social psychology of groups, and sociolinguistics/pragmatics — this report translates established theory into concrete evaluation criteria.
Twenty-three of them, mapped against what current evals already cover, ending with the handful of high-value criteria that are both important and realistically measurable: graceful repair, footing and attribution integrity, common-ground maintenance, face-redress calibration, and participation equity as a norm rather than a metric.
A fresh eval framework →
A proposal that fuses the two: grade a group-chat agent's turn-decision as a cascade — attend, speak-or-stay-silent, address, ground, compose, conduct — on a substrate of planted probes, scoring restraint and interactional conduct as first-class outcomes. Includes the smallest teachable version: a single "Is this for me, and should I speak?" demo chapter.
How to read this
Start with the survey to see how the field measures things now, then read the dynamics report to see the gap between that and what the science of conversation says competence actually requires. The first is descriptive; the second is a specification; the framework is one attempt at a buildable answer.
Provenance
Both reports were produced by multi-agent deep-research workflows with live web search and per-finding fact-checking — citations were independently verified, and several self-corrected during the run (noted inline). The verified structured findings behind each report are available as raw data:
- Eval-survey findings (JSON) — 5 sub-topics
- Dynamics findings (JSON) — 3 disciplines, 25 constructs
This is research feedstock. Nothing here is wired into a shipping product.