How Group-Chat Agent Eval Is Done Today

A literature survey for deciding whether group-chat agent evaluation is worth teaching.

1. TL;DR

The field is real but young and fragmented. The classic NLP foundations (addressee + response selection, conversation disentanglement) are mature and well-formalized, but the agent-era group-chat eval — an LLM acting as one participant in a live multi-party channel — is mostly 2024-2026 work with no shared leaderboard.
No dominant single mechanism — but a recurring spine exists: classification-accuracy on addressee/turn decisions, plus LLM-as-judge rubrics for social quality, plus simulated-participant rollouts graded by state-diff (consensus change, leakage, milestones).
The most distinctive group-chat skills are demonstrably unsolved. Frontier LLMs (GPT-4o) score near chance on addressee recognition (Inoue et al. 2025) and near-random on when-to-speak timing (Umair et al. 2024); best multi-party memory systems score 46% (GroupMemBench).
"When to stay silent" is the biggest hole. Text group-chat benchmarks assume the agent always replies and only ask which reply; silence-as-correct-action is only studied in spoken/multimodal work.
Simulator fidelity is the dominant validity threat. LLM-simulated participants are over-cooperative ("easy mode") and inflate agent scores (RealUserSim, Sim2Real-Gap) — so any group-chat eval must validate its fake humans before trusting the grade.
Verdict for teaching: there is a small, honest, demoable toy here — addressee accuracy + a "should I even speak?" gate — that cleanly extends our existing judge (ch05), trajectory (ch06), and human-scoring (ch03) chapters. Not a full new module, but a strong half-chapter.

2. Landscape by sub-topic

2.1 Benchmarks & datasets (agent-in-group-chat)

Resource	Year	What	Eval method
HSII (How Social Is It?)	2025	LLM as autonomous social agent, ~6.7 participants / 7.8 turns; 8,305 samples from real news via GPT-4 + human refinement	Cascading 4-stage score ι: format parse → target (addressee) selection → switch quality → multi-turn stability; a parse failure zeroes everything downstream
DICE-Bench	2025	First multi-round, multi-party tool-calling benchmark; 1,607 instances, 2-4 participants, 124-tool graph	Exact Match on the function call + DICE-Score measuring how dispersed tool params are across turns/speakers (higher = harder)
GroupMemBench	2026	Agent memory in multi-party chat (vs dyadic assumption of prior memory benchmarks)	Accuracy on group dynamics, speaker-grounded belief tracking, audience-adapted (ToM) language; best system 46.0%
MUCA + MUS	2024	First LLM framework for multi-user group chat (3W: What/When/Who) + paired multi-user simulator	Engagement, conversation evenness (participation balance), opinion consensus + subjective ratings, via simulated group convos
MAGPIE	2025	Contextual-privacy eval in multi-agent collaboration; ~200 high-stakes tasks where private info is essential	% sensitive info leaked to other agents while still completing the task (Gemini 2.5-Pro up to 50.7%, GPT-5 up to 35.1%)
MultiAgentBench	2025	LLM multi-agent systems, cooperative + competitive, star/tree/graph/chain topologies	Milestone-based KPIs scoring collaboration/competition quality (not just completion) across coordination protocols
Addressee & Response Selection (Hu et al.)	2018	Ubuntu IRC; sender/addressee/observer roles	Addressee selection (ADR) accuracy + response selection (RES) accuracy at varying participant counts
Molweni	2020	Multi-party MRC from Ubuntu, ~10k dialogues, 30,066 QA pairs, reply-to discourse graphs	Span MRC F1/EM (with unanswerable Qs) + discourse-dependency parsing (who-replies-to-whom)

Synthesis. Real benchmarks exist, but they split into two non-overlapping camps: agent-to-agent settings (MultiAgentBench, MAGPIE) and static-dialogue comprehension (Molweni, Ubuntu IRC). The genuinely new agent-era work (HSII, DICE-Bench, GroupMemBench) converges on the same hard core — track who said what across many speakers before acting — and reports low scores, confirming the difficulty is real. What's missing everywhere is an LLM embedded as one participant in a live human channel with interruptions and side-threads.

2.2 Turn-taking & when-to-speak

Resource	Year	What	Eval method
LLMs Know What To Say But Not When To Speak (TRP)	2024	Participant-labeled within-turn Transition Relevance Places in spoken dialogue	Binary TRP classification vs human labels; precision/recall/F1 — LLMs near-random (F1 ~0.14-0.16)
Beyond Words / MM-When2Speak	2025	Multimodal LLM choosing respond vs backchannel vs stay-silent	Per-class P/R/F1 where silence is a first-class correct label (uses its own 357 curated dyadic videos — not Fisher/MAHNOB as the abstract framing implied)
Addressee Recognition in Multimodal Multi-party (Inoue et al.)	2025	Triadic corpus, ~20% turns have explicit addressee	Accuracy vs chance; GPT-4o only marginally above chance
Addressee & Response Selection (Ouchi & Tsuboi)	2016	Foundational ARS task on Ubuntu multiparty corpus	ADR accuracy + response recall@k + joint accuracy (later SOTA ASRG ~84.65%)
MUCA	2024	Group-chat agent whose core problem is when/whether to speak (in-context "chime-in" module)	Human studies (% reporting bot "chimes in excessively": 56.25% basic vs 0% advanced) + evenness/consensus metrics
TurnGPT	2020	GPT-2 predicting turn-shifts via TRP tokens, text-only	Predicts end-of-turn / TRP tokens; outperforms prior end-of-turn baselines
Triadic VAP	2025	First Voice Activity Projection extended to 3-party	Future joint voice-activity prediction; triadic-trained beats dyadic baselines — dyadic models degrade with more parties
Lla-VAP	2024	LSTM ensemble of Llama + VAP for turn-taking	F1 on labeled turn-shift points on CCPE. Caveat: the widely-cited "83.13 F1" is the VAP baseline's recall, not the ensemble's; the ensemble's actual CCPE F1 is 0.964 — the survey mis-transcribed it.
Multi-Party Conversational Agents: A Survey	2025	Meta-resource mapping turn-detection + addressee-selection benchmarks/metrics	Survey; explicitly flags "silence-as-correct / response inhibition" as largely unexplored

Synthesis. When-to-speak is a separately gradable capability from what-to-say, and current LLMs are bad at it — the cleanest result in the whole survey. But almost all rigorous timing work is acoustic/spoken (TRP, VAP); for text group chat the "react vs reply vs ignore" decision is real but essentially ungraded. MUCA's "excessive chime-in %" is the closest thing to a reusable text-restraint metric, and it's a bespoke human study.

2.3 Addressing & mention/speaker resolution

Resource	Year	What	Eval method
Ouchi & Tsuboi (ARS)	2016	Joint "whom to address + what to say" on Ubuntu IRC	Addressee accuracy + response recall@k + joint correctness
Who Is Speaking to Whom? (W2W)	2019	Identifies the addressee of every utterance jointly (full who-talks-to-whom graph)	Per-utterance addressee accuracy, broken down by participant count
irc-disentanglement (Kummerfeld et al.)	2019	77,563 IRC messages with reply-to links; 16× larger than all prior disentanglement data; DSTC-8 Track 2	Reply-link P/R/F1 on edges + clustering (VI, one-to-one, Shen-F, exact-match conversations)
Molweni	2020	MRC over multiparty dialogue + SDRT discourse graphs	EM/F1 (BERT-wwm 67.7% F1, ~20pt drop vs SQuAD 2.0) + discourse link/relation F1
Addressee Recognition (Inoue et al. / TEIDAN)	2025	Triadic Japanese corpus, 30 sessions; ~20% explicit-addressee turns	4-way A/B/C/O addressee classification vs 80.1% majority baseline; GPT-4o 80.9% vs 80.1% chance; below chance on next-speaker (note: annotated subset is ~29 min, not 29 h)
Multimodal Conversation Structure Understanding (TV-MMPC)	2025	Speaker + addressee + reply-to relations bundled into one LLM-facing benchmark	Per-relation accuracy / Set-F1 vs human annotations
WHO Says WHAT to WHOM (survey)	2022	IJCAI survey framing MPC as WHO / WHAT / WHOM	Survey; consolidates addressee-recognition + response-selection task formulations
Multi-Party Conversational Agents: A Survey	2025	Taxonomy + metric inventory for group-chat sub-capabilities	Survey; maps datasets (Ubuntu IRC, Molweni) to accuracy/F1/disentanglement metrics

Synthesis. This is the most mature corner — addressee accuracy and disentanglement clustering metrics (VI / one-to-one / Shen-F) are well-established and automatable. The catch: they grade classification on human-authored transcripts, not whether a generative agent routes its own reply to the right person. The modern LLM result (Inoue et al.: GPT-4o barely beats an 80% majority baseline, below chance on next-speaker) shows the capability is unsolved precisely where it now matters most — inside a generating agent.

2.4 Social appropriateness, role/persona & multi-speaker context

Resource	Year	What	Eval method
SOTOPIA / SOTOPIA-Eval	2023	Goal-driven social role-play between LLM agents with private goals	7-dim rubric (Believability, Relationship, Knowledge, Secret, Social Rules, Financial, Goal); human + GPT-4 judge, validated against humans
DEBATE	2025	Whether multi-agent role-play reproduces real human group dynamics; ~29k messages, 697 groups, public + private beliefs	Utterance metrics (semantic sim, stance delta, ROUGE-L) + group opinion-dynamics (convergence, public/private dissociation) + individual partner-influence
MPCEval	2026	Purpose-built MPC generation benchmark; next-message + full-rollout	Reference-free novel metrics across speaker modeling, content quality, speaker-content consistency (the paper explicitly rejects ROUGE/BLEU/BERTScore/G-Eval, contrary to one finding's claim)
PersonaGym / PersonaScore	2024	Dynamic persona-agent eval; 200 personas, 10k questions, 150 environments	5 axes (Expected Action, Action Justification, Linguistic Habits, Persona Consistency, Toxicity Control) scored 1-5 by an LLM-judge ensemble
RENOVI	2024	9,258 dialogues annotated with social norms	Sequenced detect → classify → remediate norm violations + LLM-human norm-alignment
NormBank (SCENE)	2023	155k role/setting-conditioned social norms	Non-monotonic classification: same behavior labeled expected/permitted/unexpected by role + setting
The Social Laboratory	2025	Psychometric framework for LLMs as social actors in multi-agent debate	Conformity (shift under group pressure), persuasion, role adherence across rounds

Synthesis. Social/persona quality is graded almost entirely by LLM-as-judge rubrics (SOTOPIA's 7 dimensions, PersonaScore's 5 axes), and SOTOPIA's GPT-4-judge-vs-human validation is the template the field copies. The frontier shift is from grading single dialogues to grading emergent group dynamics — DEBATE and Social Laboratory measure conformity, partner influence, and public/private belief drift, which dyadic evals structurally cannot see. NormBank's role-conditioning (a line that's fine from one speaker but not another) is the key idea for context-dependent appropriateness, but per-turn speaker-conditioned scoring in a live group is still underdeveloped.

2.5 Simulators & user-simulation harnesses

Resource	Year	What	Eval method
ProMediate	2025	Proactive mediator agent in multi-party negotiation; Easy/Medium/Hard tiers; simulated participants in 3 conflict modes	Consensus Change, Topic Efficiency, Response Latency, Mediator Effectiveness (consensus-slope pre/post intervention), Mediator Intelligence (LLM-judge 1-5)
SOTOPIA	2023	Procedurally generated social-interaction env (dyadic → multi-party planning)	7-dim rubric, human + GPT-4 judge; paper notes GPT-4 weaker on Social Rules / Secret dims
tau2-bench (τ²-bench)	2025	Dual-control tool-agent-user benchmark (both user and agent call tools)	UserSimulator LLM drives turns; reward gated on required tool actions / DB end-state + policy; pass^k reliability over repeated trials
RealUserSim	2026	Simulators grounded in 7,275 behavioral profiles from 14k+ real WildChat conversations	PT3 fidelity benchmark (style-match 24.2%→45.3% with grounding); agent-eval on TauBench surfaces failures cooperative sims miss; failure modes "Formalism Ceiling" / "Directive Amplification"
Mind the Sim2Real Gap	2026	Quantifies how faithfully LLM user-simulators replicate humans	User-Sim Index (USI, 0-100, six dims) via Sørensen-Dice + ECE + MAE; validated on 451 humans / 165 τ-bench tasks; sims create "easy mode," binary reward orthogonal to human-perceived quality
SAGE	2025	Top-down (persona) + bottom-up (knowledge) grounded user simulator	Measured by bug-finding power: surfaces up to 33% more agent errors than generic-user baselines
GroupMemBench	2026	Multi-party agent memory; graph-grounded synthesis + adversarial asker-bound queries	Accuracy on group dynamics / speaker-grounded belief / audience-adapted language; best 46.0%
MUCA + MUS	2024	Multi-user agent + simulator modeling real chat-record behavior	Engagement, evenness, consensus vs GPT-4 baseline across decision/problem/discussion tasks
MultiAgentBench (MARBLE)	2025	Multi-agent suite incl. Werewolf / bargaining	Milestone KPI = n_j/M per agent + LLM-judged Communication/Planning/Coordination (0-5); competition win/loss

Synthesis. The reusable machinery is simulated participants + state-diff grading: spin up LLM "humans," run a rollout, score the change they produce (consensus delta, leakage rate, milestone attribution) rather than a single turn. The honest warning that runs through this entire sub-topic — and is the single most important methodological takeaway — is that the simulated humans are too nice: RealUserSim and Sim2Real-Gap both show LLM simulators inflate agent success and that binary task reward is orthogonal to human-perceived quality. tau2-bench's pass^k is the standard reliability metric practitioners already reach for. True N-simulated-humans-plus-one-agent harnesses with grading remain rare (ProMediate, MUCA, GroupMemBench).

3. How group-chat eval actually works (the methods core)

Across all five sub-topics, the same handful of mechanisms recur. A "group-chat agent eval" is some composition of:

Addressee accuracy (the signature metric). Pick the correct interlocutor from prior speakers; scored as classification accuracy, sometimes 4-way (A/B/C/none) against a majority-class baseline (Inoue et al.), sometimes joint with response selection (Ouchi & Tsuboi's ADR + recall@k). It appears verbatim as HSII's r2 stage. This is the cleanest, most automatable group-chat probe and the historical root of the field.
Response-decision / when-to-speak F1. Binary "is this a valid moment to speak?" graded by precision/recall/F1 against human-labeled transition points (TRP work, TurnGPT, VAP family). The crucial variant — silence as a first-class correct label (respond vs backchannel vs stay-silent) — exists only in spoken/multimodal work (MM-When2Speak); MUCA's "excessive chime-in %" is the text proxy.
Conversation disentanglement / structure metrics. Before an agent can act it must untangle interleaved threads: reply-link P/R/F1 on edges, then clustering scored by VI, one-to-one overlap, and Shen-F (irc-disentanglement), plus discourse-dependency F1 (Molweni). This is the prerequisite-skill layer.
Simulated-user rollout + state-diff. Stand up LLM participants, run a multi-turn rollout, and grade the delta they produce: consensus change and mediator-effectiveness slope (ProMediate), engagement/evenness/consensus (MUCA), tool-action/DB end-state with pass^k reliability (tau2-bench), milestone attribution (MultiAgentBench). Always paired with a fidelity check — USI (Sim2Real-Gap) or PT3 (RealUserSim) — because over-cooperative sims inflate the grade.
LLM-as-judge social rubrics. Multi-dimension 1-5 / Likert rubrics for the qualities that have no exact-match answer: SOTOPIA's 7 dimensions, PersonaScore's 5 axes, ProMediate's Mediator Intelligence. SOTOPIA's GPT-4-judge-validated-against-humans is the calibration template — with the documented caveat that judges are weaker on subtle dimensions (Social Rules, Secret), and judge reliability across many speakers in long transcripts is largely unvalidated.
Group-specific safety / leakage. A uniquely multi-party axis: what does the agent reveal to other participants? Measured as contextual-integrity leakage rate (MAGPIE), and as speaker-grounded belief/audience-adaptation accuracy (GroupMemBench).

The recurring shape: a cheap, automatable classification gate (addressee, when-to-speak, disentanglement) + an expensive judge/rollout layer (social rubric, state-diff) + a fidelity check on the simulated humans.

4. Open gaps (the genuine frontier)

No live, human-in-the-loop, embedded-agent benchmark. Everything is agent-vs-agent or static-transcript classification. An LLM as one participant in a sustained real channel with interruptions and side-threads is unmeasured.
Silence-as-correct in text is unbenchmarked. Text MPC benchmarks assume the agent always replies; there is no adopted metric for over-eager interjection / false-positive responding in text group chat.
Everything is dyadic-or-triadic. Genuine N>3 dynamics (overlapping threads, shifting floor) are barely covered, and triadic VAP shows dyadic models degrade as parties grow — existing metrics may not transfer.
No unified scorecard. Addressee accuracy, evenness, leakage rate, disentanglement VI, and milestone KPIs are siloed per paper. There is no joint metric scoring whether + when + to whom + what in one pass, and no "group-chat competence" leaderboard.
Speaker-grounded memory / belief tracking is barely off the ground. GroupMemBench (2026) is the first standardized probe; best systems sit at 46%.
Simulator fidelity is unsolved and under-reported. Sims are "easy mode"; almost no group-chat eval reports a USI/PT3-style fidelity score alongside agent grades, and persona/decoding configs that move scores are rarely pinned or released.
Judge validity at multi-party scale is unvalidated. LLM-judge bias is documented dyadically (SOTOPIA); no calibrated, speaker-attributing judge or inter-rater protocol exists for long multi-party transcripts.
Multi-party safety beyond privacy is nearly absent — colluding participants, cross-user prompt injection, moderation of harmful cross-user dynamics.
Real-platform grounding is thin. Most agent-era data is synthetic/GPT-generated; few use real Slack/Discord/Teams traces, so production ecological validity is unverified.

5. Relevance to our demo

Honest take: yes, there's a demoable toy — a narrow, sharp one — but it's a half-chapter extension, not a new module. The strongest single asset is that the most distinctive group-chat skill (addressee resolution) is cleanly automatable and demonstrably broken in frontier models — which is exactly the kind of "watch it fail with your own input" insight the curriculum is built around.

Concrete demoable toy: an "Is this for me, and should I speak?" gate.

Hand the agent a short 3-4-person transcript (synthetic, hardcoded in data/ like our other toy sets) and ask it to (a) pick the addressee of the last message and (b) decide respond / react / stay-silent.
Grade addressee against a gold label as accuracy vs a majority-class baseline (the Inoue et al. framing) — this viscerally shows "the model barely beats always-guessing." That's a great teaching beat.
This is a near-exact transplant of our existing ch02 (code grader) exact-match/classification pattern, just over a multi-party transcript instead of a math answer.

How it extends specific existing chapters:

ch06 (trajectory grader) — the natural host. A trajectory through a group chat adds a per-step addressee and speak/stay-silent decision to grade alongside tool calls. DICE-Bench's "params scattered across speakers" and ProMediate's intervention-latency are the conceptual upgrades; we'd toy-ify them, not import them.
ch05 (LLM as judge) — directly reusable. SOTOPIA's 7-dimension rubric and the GPT-4-judge-validated-against-human pattern are exactly ch05's mechanism; a "social appropriateness" rubric (was the chime-in warranted? was it addressed to the right person?) drops into our existing judge with a different rubric, reusing ch04's rubric-design widget.
ch03 (human as judge) — the calibration story carries over: MUCA's "56% of users said it chimed in too much" is the human-scored restraint signal, and our ch07 (graders disagree) calibration view already compares human vs judge.

What to not oversell: the full research frontier (live human channels, N>3 floor management, simulator fidelity, speaker-grounded memory) is heavy and unsettled — not toy-able in under three minutes. The simulator-fidelity caution (Sim2Real-Gap: "your fake users are too nice") is a genuinely good concept slide but not a runnable demo. Recommend: a single chapter ("Who is this for, and should I answer?") built on addressee-accuracy-vs-baseline + a social-rubric judge, explicitly reusing ch02/ch05/ch03 machinery. That's real, honest, and finishes in three minutes; anything larger is a research project, not a mentorship demo.