Skip to main content
Group-Chat Agent Evaluation

How Group-Chat Agent Eval Is Done Today

Literature survey — benchmarks, datasets, and the six recurring eval mechanisms

A literature survey for deciding whether group-chat agent evaluation is worth teaching.

1. TL;DR

2. Landscape by sub-topic

2.1 Benchmarks & datasets (agent-in-group-chat)

Resource Year What Eval method
HSII (How Social Is It?) 2025 LLM as autonomous social agent, ~6.7 participants / 7.8 turns; 8,305 samples from real news via GPT-4 + human refinement Cascading 4-stage score ι: format parse → target (addressee) selection → switch quality → multi-turn stability; a parse failure zeroes everything downstream
DICE-Bench 2025 First multi-round, multi-party tool-calling benchmark; 1,607 instances, 2-4 participants, 124-tool graph Exact Match on the function call + DICE-Score measuring how dispersed tool params are across turns/speakers (higher = harder)
GroupMemBench 2026 Agent memory in multi-party chat (vs dyadic assumption of prior memory benchmarks) Accuracy on group dynamics, speaker-grounded belief tracking, audience-adapted (ToM) language; best system 46.0%
MUCA + MUS 2024 First LLM framework for multi-user group chat (3W: What/When/Who) + paired multi-user simulator Engagement, conversation evenness (participation balance), opinion consensus + subjective ratings, via simulated group convos
MAGPIE 2025 Contextual-privacy eval in multi-agent collaboration; ~200 high-stakes tasks where private info is essential % sensitive info leaked to other agents while still completing the task (Gemini 2.5-Pro up to 50.7%, GPT-5 up to 35.1%)
MultiAgentBench 2025 LLM multi-agent systems, cooperative + competitive, star/tree/graph/chain topologies Milestone-based KPIs scoring collaboration/competition quality (not just completion) across coordination protocols
Addressee & Response Selection (Hu et al.) 2018 Ubuntu IRC; sender/addressee/observer roles Addressee selection (ADR) accuracy + response selection (RES) accuracy at varying participant counts
Molweni 2020 Multi-party MRC from Ubuntu, ~10k dialogues, 30,066 QA pairs, reply-to discourse graphs Span MRC F1/EM (with unanswerable Qs) + discourse-dependency parsing (who-replies-to-whom)

Synthesis. Real benchmarks exist, but they split into two non-overlapping camps: agent-to-agent settings (MultiAgentBench, MAGPIE) and static-dialogue comprehension (Molweni, Ubuntu IRC). The genuinely new agent-era work (HSII, DICE-Bench, GroupMemBench) converges on the same hard core — track who said what across many speakers before acting — and reports low scores, confirming the difficulty is real. What's missing everywhere is an LLM embedded as one participant in a live human channel with interruptions and side-threads.

2.2 Turn-taking & when-to-speak

Resource Year What Eval method
LLMs Know What To Say But Not When To Speak (TRP) 2024 Participant-labeled within-turn Transition Relevance Places in spoken dialogue Binary TRP classification vs human labels; precision/recall/F1 — LLMs near-random (F1 ~0.14-0.16)
Beyond Words / MM-When2Speak 2025 Multimodal LLM choosing respond vs backchannel vs stay-silent Per-class P/R/F1 where silence is a first-class correct label (uses its own 357 curated dyadic videos — not Fisher/MAHNOB as the abstract framing implied)
Addressee Recognition in Multimodal Multi-party (Inoue et al.) 2025 Triadic corpus, ~20% turns have explicit addressee Accuracy vs chance; GPT-4o only marginally above chance
Addressee & Response Selection (Ouchi & Tsuboi) 2016 Foundational ARS task on Ubuntu multiparty corpus ADR accuracy + response recall@k + joint accuracy (later SOTA ASRG ~84.65%)
MUCA 2024 Group-chat agent whose core problem is when/whether to speak (in-context "chime-in" module) Human studies (% reporting bot "chimes in excessively": 56.25% basic vs 0% advanced) + evenness/consensus metrics
TurnGPT 2020 GPT-2 predicting turn-shifts via TRP tokens, text-only Predicts end-of-turn / TRP tokens; outperforms prior end-of-turn baselines
Triadic VAP 2025 First Voice Activity Projection extended to 3-party Future joint voice-activity prediction; triadic-trained beats dyadic baselines — dyadic models degrade with more parties
Lla-VAP 2024 LSTM ensemble of Llama + VAP for turn-taking F1 on labeled turn-shift points on CCPE. Caveat: the widely-cited "83.13 F1" is the VAP baseline's recall, not the ensemble's; the ensemble's actual CCPE F1 is 0.964 — the survey mis-transcribed it.
Multi-Party Conversational Agents: A Survey 2025 Meta-resource mapping turn-detection + addressee-selection benchmarks/metrics Survey; explicitly flags "silence-as-correct / response inhibition" as largely unexplored

Synthesis. When-to-speak is a separately gradable capability from what-to-say, and current LLMs are bad at it — the cleanest result in the whole survey. But almost all rigorous timing work is acoustic/spoken (TRP, VAP); for text group chat the "react vs reply vs ignore" decision is real but essentially ungraded. MUCA's "excessive chime-in %" is the closest thing to a reusable text-restraint metric, and it's a bespoke human study.

2.3 Addressing & mention/speaker resolution

Resource Year What Eval method
Ouchi & Tsuboi (ARS) 2016 Joint "whom to address + what to say" on Ubuntu IRC Addressee accuracy + response recall@k + joint correctness
Who Is Speaking to Whom? (W2W) 2019 Identifies the addressee of every utterance jointly (full who-talks-to-whom graph) Per-utterance addressee accuracy, broken down by participant count
irc-disentanglement (Kummerfeld et al.) 2019 77,563 IRC messages with reply-to links; 16× larger than all prior disentanglement data; DSTC-8 Track 2 Reply-link P/R/F1 on edges + clustering (VI, one-to-one, Shen-F, exact-match conversations)
Molweni 2020 MRC over multiparty dialogue + SDRT discourse graphs EM/F1 (BERT-wwm 67.7% F1, ~20pt drop vs SQuAD 2.0) + discourse link/relation F1
Addressee Recognition (Inoue et al. / TEIDAN) 2025 Triadic Japanese corpus, 30 sessions; ~20% explicit-addressee turns 4-way A/B/C/O addressee classification vs 80.1% majority baseline; GPT-4o 80.9% vs 80.1% chance; below chance on next-speaker (note: annotated subset is ~29 min, not 29 h)
Multimodal Conversation Structure Understanding (TV-MMPC) 2025 Speaker + addressee + reply-to relations bundled into one LLM-facing benchmark Per-relation accuracy / Set-F1 vs human annotations
WHO Says WHAT to WHOM (survey) 2022 IJCAI survey framing MPC as WHO / WHAT / WHOM Survey; consolidates addressee-recognition + response-selection task formulations
Multi-Party Conversational Agents: A Survey 2025 Taxonomy + metric inventory for group-chat sub-capabilities Survey; maps datasets (Ubuntu IRC, Molweni) to accuracy/F1/disentanglement metrics

Synthesis. This is the most mature corner — addressee accuracy and disentanglement clustering metrics (VI / one-to-one / Shen-F) are well-established and automatable. The catch: they grade classification on human-authored transcripts, not whether a generative agent routes its own reply to the right person. The modern LLM result (Inoue et al.: GPT-4o barely beats an 80% majority baseline, below chance on next-speaker) shows the capability is unsolved precisely where it now matters most — inside a generating agent.

2.4 Social appropriateness, role/persona & multi-speaker context

Resource Year What Eval method
SOTOPIA / SOTOPIA-Eval 2023 Goal-driven social role-play between LLM agents with private goals 7-dim rubric (Believability, Relationship, Knowledge, Secret, Social Rules, Financial, Goal); human + GPT-4 judge, validated against humans
DEBATE 2025 Whether multi-agent role-play reproduces real human group dynamics; ~29k messages, 697 groups, public + private beliefs Utterance metrics (semantic sim, stance delta, ROUGE-L) + group opinion-dynamics (convergence, public/private dissociation) + individual partner-influence
MPCEval 2026 Purpose-built MPC generation benchmark; next-message + full-rollout Reference-free novel metrics across speaker modeling, content quality, speaker-content consistency (the paper explicitly rejects ROUGE/BLEU/BERTScore/G-Eval, contrary to one finding's claim)
PersonaGym / PersonaScore 2024 Dynamic persona-agent eval; 200 personas, 10k questions, 150 environments 5 axes (Expected Action, Action Justification, Linguistic Habits, Persona Consistency, Toxicity Control) scored 1-5 by an LLM-judge ensemble
RENOVI 2024 9,258 dialogues annotated with social norms Sequenced detect → classify → remediate norm violations + LLM-human norm-alignment
NormBank (SCENE) 2023 155k role/setting-conditioned social norms Non-monotonic classification: same behavior labeled expected/permitted/unexpected by role + setting
The Social Laboratory 2025 Psychometric framework for LLMs as social actors in multi-agent debate Conformity (shift under group pressure), persuasion, role adherence across rounds

Synthesis. Social/persona quality is graded almost entirely by LLM-as-judge rubrics (SOTOPIA's 7 dimensions, PersonaScore's 5 axes), and SOTOPIA's GPT-4-judge-vs-human validation is the template the field copies. The frontier shift is from grading single dialogues to grading emergent group dynamicsDEBATE and Social Laboratory measure conformity, partner influence, and public/private belief drift, which dyadic evals structurally cannot see. NormBank's role-conditioning (a line that's fine from one speaker but not another) is the key idea for context-dependent appropriateness, but per-turn speaker-conditioned scoring in a live group is still underdeveloped.

2.5 Simulators & user-simulation harnesses

Resource Year What Eval method
ProMediate 2025 Proactive mediator agent in multi-party negotiation; Easy/Medium/Hard tiers; simulated participants in 3 conflict modes Consensus Change, Topic Efficiency, Response Latency, Mediator Effectiveness (consensus-slope pre/post intervention), Mediator Intelligence (LLM-judge 1-5)
SOTOPIA 2023 Procedurally generated social-interaction env (dyadic → multi-party planning) 7-dim rubric, human + GPT-4 judge; paper notes GPT-4 weaker on Social Rules / Secret dims
tau2-bench (τ²-bench) 2025 Dual-control tool-agent-user benchmark (both user and agent call tools) UserSimulator LLM drives turns; reward gated on required tool actions / DB end-state + policy; pass^k reliability over repeated trials
RealUserSim 2026 Simulators grounded in 7,275 behavioral profiles from 14k+ real WildChat conversations PT3 fidelity benchmark (style-match 24.2%→45.3% with grounding); agent-eval on TauBench surfaces failures cooperative sims miss; failure modes "Formalism Ceiling" / "Directive Amplification"
Mind the Sim2Real Gap 2026 Quantifies how faithfully LLM user-simulators replicate humans User-Sim Index (USI, 0-100, six dims) via Sørensen-Dice + ECE + MAE; validated on 451 humans / 165 τ-bench tasks; sims create "easy mode," binary reward orthogonal to human-perceived quality
SAGE 2025 Top-down (persona) + bottom-up (knowledge) grounded user simulator Measured by bug-finding power: surfaces up to 33% more agent errors than generic-user baselines
GroupMemBench 2026 Multi-party agent memory; graph-grounded synthesis + adversarial asker-bound queries Accuracy on group dynamics / speaker-grounded belief / audience-adapted language; best 46.0%
MUCA + MUS 2024 Multi-user agent + simulator modeling real chat-record behavior Engagement, evenness, consensus vs GPT-4 baseline across decision/problem/discussion tasks
MultiAgentBench (MARBLE) 2025 Multi-agent suite incl. Werewolf / bargaining Milestone KPI = n_j/M per agent + LLM-judged Communication/Planning/Coordination (0-5); competition win/loss

Synthesis. The reusable machinery is simulated participants + state-diff grading: spin up LLM "humans," run a rollout, score the change they produce (consensus delta, leakage rate, milestone attribution) rather than a single turn. The honest warning that runs through this entire sub-topic — and is the single most important methodological takeaway — is that the simulated humans are too nice: RealUserSim and Sim2Real-Gap both show LLM simulators inflate agent success and that binary task reward is orthogonal to human-perceived quality. tau2-bench's pass^k is the standard reliability metric practitioners already reach for. True N-simulated-humans-plus-one-agent harnesses with grading remain rare (ProMediate, MUCA, GroupMemBench).

3. How group-chat eval actually works (the methods core)

Across all five sub-topics, the same handful of mechanisms recur. A "group-chat agent eval" is some composition of:

  1. Addressee accuracy (the signature metric). Pick the correct interlocutor from prior speakers; scored as classification accuracy, sometimes 4-way (A/B/C/none) against a majority-class baseline (Inoue et al.), sometimes joint with response selection (Ouchi & Tsuboi's ADR + recall@k). It appears verbatim as HSII's r2 stage. This is the cleanest, most automatable group-chat probe and the historical root of the field.

  2. Response-decision / when-to-speak F1. Binary "is this a valid moment to speak?" graded by precision/recall/F1 against human-labeled transition points (TRP work, TurnGPT, VAP family). The crucial variant — silence as a first-class correct label (respond vs backchannel vs stay-silent) — exists only in spoken/multimodal work (MM-When2Speak); MUCA's "excessive chime-in %" is the text proxy.

  3. Conversation disentanglement / structure metrics. Before an agent can act it must untangle interleaved threads: reply-link P/R/F1 on edges, then clustering scored by VI, one-to-one overlap, and Shen-F (irc-disentanglement), plus discourse-dependency F1 (Molweni). This is the prerequisite-skill layer.

  4. Simulated-user rollout + state-diff. Stand up LLM participants, run a multi-turn rollout, and grade the delta they produce: consensus change and mediator-effectiveness slope (ProMediate), engagement/evenness/consensus (MUCA), tool-action/DB end-state with pass^k reliability (tau2-bench), milestone attribution (MultiAgentBench). Always paired with a fidelity check — USI (Sim2Real-Gap) or PT3 (RealUserSim) — because over-cooperative sims inflate the grade.

  5. LLM-as-judge social rubrics. Multi-dimension 1-5 / Likert rubrics for the qualities that have no exact-match answer: SOTOPIA's 7 dimensions, PersonaScore's 5 axes, ProMediate's Mediator Intelligence. SOTOPIA's GPT-4-judge-validated-against-humans is the calibration template — with the documented caveat that judges are weaker on subtle dimensions (Social Rules, Secret), and judge reliability across many speakers in long transcripts is largely unvalidated.

  6. Group-specific safety / leakage. A uniquely multi-party axis: what does the agent reveal to other participants? Measured as contextual-integrity leakage rate (MAGPIE), and as speaker-grounded belief/audience-adaptation accuracy (GroupMemBench).

The recurring shape: a cheap, automatable classification gate (addressee, when-to-speak, disentanglement) + an expensive judge/rollout layer (social rubric, state-diff) + a fidelity check on the simulated humans.

4. Open gaps (the genuine frontier)

5. Relevance to our demo

Honest take: yes, there's a demoable toy — a narrow, sharp one — but it's a half-chapter extension, not a new module. The strongest single asset is that the most distinctive group-chat skill (addressee resolution) is cleanly automatable and demonstrably broken in frontier models — which is exactly the kind of "watch it fail with your own input" insight the curriculum is built around.

Concrete demoable toy: an "Is this for me, and should I speak?" gate.

How it extends specific existing chapters:

What to not oversell: the full research frontier (live human channels, N>3 floor management, simulator fidelity, speaker-grounded memory) is heavy and unsettled — not toy-able in under three minutes. The simulator-fidelity caution (Sim2Real-Gap: "your fake users are too nice") is a genuinely good concept slide but not a runnable demo. Recommend: a single chapter ("Who is this for, and should I answer?") built on addressee-accuracy-vs-baseline + a social-rubric judge, explicitly reusing ch02/ch05/ch03 machinery. That's real, honest, and finishes in three minutes; anything larger is a research project, not a mentorship demo.