A literature survey for deciding whether group-chat agent evaluation is worth teaching.
1. TL;DR
- The field is real but young and fragmented. The classic NLP foundations (addressee + response selection, conversation disentanglement) are mature and well-formalized, but the agent-era group-chat eval — an LLM acting as one participant in a live multi-party channel — is mostly 2024-2026 work with no shared leaderboard.
- No dominant single mechanism — but a recurring spine exists: classification-accuracy on addressee/turn decisions, plus LLM-as-judge rubrics for social quality, plus simulated-participant rollouts graded by state-diff (consensus change, leakage, milestones).
- The most distinctive group-chat skills are demonstrably unsolved. Frontier LLMs (GPT-4o) score near chance on addressee recognition (Inoue et al. 2025) and near-random on when-to-speak timing (Umair et al. 2024); best multi-party memory systems score 46% (GroupMemBench).
- "When to stay silent" is the biggest hole. Text group-chat benchmarks assume the agent always replies and only ask which reply; silence-as-correct-action is only studied in spoken/multimodal work.
- Simulator fidelity is the dominant validity threat. LLM-simulated participants are over-cooperative ("easy mode") and inflate agent scores (RealUserSim, Sim2Real-Gap) — so any group-chat eval must validate its fake humans before trusting the grade.
- Verdict for teaching: there is a small, honest, demoable toy here — addressee accuracy + a "should I even speak?" gate — that cleanly extends our existing judge (ch05), trajectory (ch06), and human-scoring (ch03) chapters. Not a full new module, but a strong half-chapter.
2. Landscape by sub-topic
2.1 Benchmarks & datasets (agent-in-group-chat)
| Resource | Year | What | Eval method |
|---|---|---|---|
| HSII (How Social Is It?) | 2025 | LLM as autonomous social agent, ~6.7 participants / 7.8 turns; 8,305 samples from real news via GPT-4 + human refinement | Cascading 4-stage score ι: format parse → target (addressee) selection → switch quality → multi-turn stability; a parse failure zeroes everything downstream |
| DICE-Bench | 2025 | First multi-round, multi-party tool-calling benchmark; 1,607 instances, 2-4 participants, 124-tool graph | Exact Match on the function call + DICE-Score measuring how dispersed tool params are across turns/speakers (higher = harder) |
| GroupMemBench | 2026 | Agent memory in multi-party chat (vs dyadic assumption of prior memory benchmarks) | Accuracy on group dynamics, speaker-grounded belief tracking, audience-adapted (ToM) language; best system 46.0% |
| MUCA + MUS | 2024 | First LLM framework for multi-user group chat (3W: What/When/Who) + paired multi-user simulator | Engagement, conversation evenness (participation balance), opinion consensus + subjective ratings, via simulated group convos |
| MAGPIE | 2025 | Contextual-privacy eval in multi-agent collaboration; ~200 high-stakes tasks where private info is essential | % sensitive info leaked to other agents while still completing the task (Gemini 2.5-Pro up to 50.7%, GPT-5 up to 35.1%) |
| MultiAgentBench | 2025 | LLM multi-agent systems, cooperative + competitive, star/tree/graph/chain topologies | Milestone-based KPIs scoring collaboration/competition quality (not just completion) across coordination protocols |
| Addressee & Response Selection (Hu et al.) | 2018 | Ubuntu IRC; sender/addressee/observer roles | Addressee selection (ADR) accuracy + response selection (RES) accuracy at varying participant counts |
| Molweni | 2020 | Multi-party MRC from Ubuntu, ~10k dialogues, 30,066 QA pairs, reply-to discourse graphs | Span MRC F1/EM (with unanswerable Qs) + discourse-dependency parsing (who-replies-to-whom) |
Synthesis. Real benchmarks exist, but they split into two non-overlapping camps: agent-to-agent settings (MultiAgentBench, MAGPIE) and static-dialogue comprehension (Molweni, Ubuntu IRC). The genuinely new agent-era work (HSII, DICE-Bench, GroupMemBench) converges on the same hard core — track who said what across many speakers before acting — and reports low scores, confirming the difficulty is real. What's missing everywhere is an LLM embedded as one participant in a live human channel with interruptions and side-threads.
2.2 Turn-taking & when-to-speak
| Resource | Year | What | Eval method |
|---|---|---|---|
| LLMs Know What To Say But Not When To Speak (TRP) | 2024 | Participant-labeled within-turn Transition Relevance Places in spoken dialogue | Binary TRP classification vs human labels; precision/recall/F1 — LLMs near-random (F1 ~0.14-0.16) |
| Beyond Words / MM-When2Speak | 2025 | Multimodal LLM choosing respond vs backchannel vs stay-silent | Per-class P/R/F1 where silence is a first-class correct label (uses its own 357 curated dyadic videos — not Fisher/MAHNOB as the abstract framing implied) |
| Addressee Recognition in Multimodal Multi-party (Inoue et al.) | 2025 | Triadic corpus, ~20% turns have explicit addressee | Accuracy vs chance; GPT-4o only marginally above chance |
| Addressee & Response Selection (Ouchi & Tsuboi) | 2016 | Foundational ARS task on Ubuntu multiparty corpus | ADR accuracy + response recall@k + joint accuracy (later SOTA ASRG ~84.65%) |
| MUCA | 2024 | Group-chat agent whose core problem is when/whether to speak (in-context "chime-in" module) | Human studies (% reporting bot "chimes in excessively": 56.25% basic vs 0% advanced) + evenness/consensus metrics |
| TurnGPT | 2020 | GPT-2 predicting turn-shifts via TRP tokens, text-only | Predicts end-of-turn / TRP tokens; outperforms prior end-of-turn baselines |
| Triadic VAP | 2025 | First Voice Activity Projection extended to 3-party | Future joint voice-activity prediction; triadic-trained beats dyadic baselines — dyadic models degrade with more parties |
| Lla-VAP | 2024 | LSTM ensemble of Llama + VAP for turn-taking | F1 on labeled turn-shift points on CCPE. Caveat: the widely-cited "83.13 F1" is the VAP baseline's recall, not the ensemble's; the ensemble's actual CCPE F1 is 0.964 — the survey mis-transcribed it. |
| Multi-Party Conversational Agents: A Survey | 2025 | Meta-resource mapping turn-detection + addressee-selection benchmarks/metrics | Survey; explicitly flags "silence-as-correct / response inhibition" as largely unexplored |
Synthesis. When-to-speak is a separately gradable capability from what-to-say, and current LLMs are bad at it — the cleanest result in the whole survey. But almost all rigorous timing work is acoustic/spoken (TRP, VAP); for text group chat the "react vs reply vs ignore" decision is real but essentially ungraded. MUCA's "excessive chime-in %" is the closest thing to a reusable text-restraint metric, and it's a bespoke human study.
2.3 Addressing & mention/speaker resolution
| Resource | Year | What | Eval method |
|---|---|---|---|
| Ouchi & Tsuboi (ARS) | 2016 | Joint "whom to address + what to say" on Ubuntu IRC | Addressee accuracy + response recall@k + joint correctness |
| Who Is Speaking to Whom? (W2W) | 2019 | Identifies the addressee of every utterance jointly (full who-talks-to-whom graph) | Per-utterance addressee accuracy, broken down by participant count |
| irc-disentanglement (Kummerfeld et al.) | 2019 | 77,563 IRC messages with reply-to links; 16× larger than all prior disentanglement data; DSTC-8 Track 2 | Reply-link P/R/F1 on edges + clustering (VI, one-to-one, Shen-F, exact-match conversations) |
| Molweni | 2020 | MRC over multiparty dialogue + SDRT discourse graphs | EM/F1 (BERT-wwm 67.7% F1, ~20pt drop vs SQuAD 2.0) + discourse link/relation F1 |
| Addressee Recognition (Inoue et al. / TEIDAN) | 2025 | Triadic Japanese corpus, 30 sessions; ~20% explicit-addressee turns | 4-way A/B/C/O addressee classification vs 80.1% majority baseline; GPT-4o 80.9% vs 80.1% chance; below chance on next-speaker (note: annotated subset is ~29 min, not 29 h) |
| Multimodal Conversation Structure Understanding (TV-MMPC) | 2025 | Speaker + addressee + reply-to relations bundled into one LLM-facing benchmark | Per-relation accuracy / Set-F1 vs human annotations |
| WHO Says WHAT to WHOM (survey) | 2022 | IJCAI survey framing MPC as WHO / WHAT / WHOM | Survey; consolidates addressee-recognition + response-selection task formulations |
| Multi-Party Conversational Agents: A Survey | 2025 | Taxonomy + metric inventory for group-chat sub-capabilities | Survey; maps datasets (Ubuntu IRC, Molweni) to accuracy/F1/disentanglement metrics |
Synthesis. This is the most mature corner — addressee accuracy and disentanglement clustering metrics (VI / one-to-one / Shen-F) are well-established and automatable. The catch: they grade classification on human-authored transcripts, not whether a generative agent routes its own reply to the right person. The modern LLM result (Inoue et al.: GPT-4o barely beats an 80% majority baseline, below chance on next-speaker) shows the capability is unsolved precisely where it now matters most — inside a generating agent.
2.4 Social appropriateness, role/persona & multi-speaker context
| Resource | Year | What | Eval method |
|---|---|---|---|
| SOTOPIA / SOTOPIA-Eval | 2023 | Goal-driven social role-play between LLM agents with private goals | 7-dim rubric (Believability, Relationship, Knowledge, Secret, Social Rules, Financial, Goal); human + GPT-4 judge, validated against humans |
| DEBATE | 2025 | Whether multi-agent role-play reproduces real human group dynamics; ~29k messages, 697 groups, public + private beliefs | Utterance metrics (semantic sim, stance delta, ROUGE-L) + group opinion-dynamics (convergence, public/private dissociation) + individual partner-influence |
| MPCEval | 2026 | Purpose-built MPC generation benchmark; next-message + full-rollout | Reference-free novel metrics across speaker modeling, content quality, speaker-content consistency (the paper explicitly rejects ROUGE/BLEU/BERTScore/G-Eval, contrary to one finding's claim) |
| PersonaGym / PersonaScore | 2024 | Dynamic persona-agent eval; 200 personas, 10k questions, 150 environments | 5 axes (Expected Action, Action Justification, Linguistic Habits, Persona Consistency, Toxicity Control) scored 1-5 by an LLM-judge ensemble |
| RENOVI | 2024 | 9,258 dialogues annotated with social norms | Sequenced detect → classify → remediate norm violations + LLM-human norm-alignment |
| NormBank (SCENE) | 2023 | 155k role/setting-conditioned social norms | Non-monotonic classification: same behavior labeled expected/permitted/unexpected by role + setting |
| The Social Laboratory | 2025 | Psychometric framework for LLMs as social actors in multi-agent debate | Conformity (shift under group pressure), persuasion, role adherence across rounds |
Synthesis. Social/persona quality is graded almost entirely by LLM-as-judge rubrics (SOTOPIA's 7 dimensions, PersonaScore's 5 axes), and SOTOPIA's GPT-4-judge-vs-human validation is the template the field copies. The frontier shift is from grading single dialogues to grading emergent group dynamics — DEBATE and Social Laboratory measure conformity, partner influence, and public/private belief drift, which dyadic evals structurally cannot see. NormBank's role-conditioning (a line that's fine from one speaker but not another) is the key idea for context-dependent appropriateness, but per-turn speaker-conditioned scoring in a live group is still underdeveloped.
2.5 Simulators & user-simulation harnesses
| Resource | Year | What | Eval method |
|---|---|---|---|
| ProMediate | 2025 | Proactive mediator agent in multi-party negotiation; Easy/Medium/Hard tiers; simulated participants in 3 conflict modes | Consensus Change, Topic Efficiency, Response Latency, Mediator Effectiveness (consensus-slope pre/post intervention), Mediator Intelligence (LLM-judge 1-5) |
| SOTOPIA | 2023 | Procedurally generated social-interaction env (dyadic → multi-party planning) | 7-dim rubric, human + GPT-4 judge; paper notes GPT-4 weaker on Social Rules / Secret dims |
| tau2-bench (τ²-bench) | 2025 | Dual-control tool-agent-user benchmark (both user and agent call tools) | UserSimulator LLM drives turns; reward gated on required tool actions / DB end-state + policy; pass^k reliability over repeated trials |
| RealUserSim | 2026 | Simulators grounded in 7,275 behavioral profiles from 14k+ real WildChat conversations | PT3 fidelity benchmark (style-match 24.2%→45.3% with grounding); agent-eval on TauBench surfaces failures cooperative sims miss; failure modes "Formalism Ceiling" / "Directive Amplification" |
| Mind the Sim2Real Gap | 2026 | Quantifies how faithfully LLM user-simulators replicate humans | User-Sim Index (USI, 0-100, six dims) via Sørensen-Dice + ECE + MAE; validated on 451 humans / 165 τ-bench tasks; sims create "easy mode," binary reward orthogonal to human-perceived quality |
| SAGE | 2025 | Top-down (persona) + bottom-up (knowledge) grounded user simulator | Measured by bug-finding power: surfaces up to 33% more agent errors than generic-user baselines |
| GroupMemBench | 2026 | Multi-party agent memory; graph-grounded synthesis + adversarial asker-bound queries | Accuracy on group dynamics / speaker-grounded belief / audience-adapted language; best 46.0% |
| MUCA + MUS | 2024 | Multi-user agent + simulator modeling real chat-record behavior | Engagement, evenness, consensus vs GPT-4 baseline across decision/problem/discussion tasks |
| MultiAgentBench (MARBLE) | 2025 | Multi-agent suite incl. Werewolf / bargaining | Milestone KPI = n_j/M per agent + LLM-judged Communication/Planning/Coordination (0-5); competition win/loss |
Synthesis. The reusable machinery is simulated participants + state-diff grading: spin up LLM "humans," run a rollout, score the change they produce (consensus delta, leakage rate, milestone attribution) rather than a single turn. The honest warning that runs through this entire sub-topic — and is the single most important methodological takeaway — is that the simulated humans are too nice: RealUserSim and Sim2Real-Gap both show LLM simulators inflate agent success and that binary task reward is orthogonal to human-perceived quality. tau2-bench's pass^k is the standard reliability metric practitioners already reach for. True N-simulated-humans-plus-one-agent harnesses with grading remain rare (ProMediate, MUCA, GroupMemBench).
3. How group-chat eval actually works (the methods core)
Across all five sub-topics, the same handful of mechanisms recur. A "group-chat agent eval" is some composition of:
Addressee accuracy (the signature metric). Pick the correct interlocutor from prior speakers; scored as classification accuracy, sometimes 4-way (A/B/C/none) against a majority-class baseline (Inoue et al.), sometimes joint with response selection (Ouchi & Tsuboi's ADR + recall@k). It appears verbatim as HSII's r2 stage. This is the cleanest, most automatable group-chat probe and the historical root of the field.
Response-decision / when-to-speak F1. Binary "is this a valid moment to speak?" graded by precision/recall/F1 against human-labeled transition points (TRP work, TurnGPT, VAP family). The crucial variant — silence as a first-class correct label (respond vs backchannel vs stay-silent) — exists only in spoken/multimodal work (MM-When2Speak); MUCA's "excessive chime-in %" is the text proxy.
Conversation disentanglement / structure metrics. Before an agent can act it must untangle interleaved threads: reply-link P/R/F1 on edges, then clustering scored by VI, one-to-one overlap, and Shen-F (irc-disentanglement), plus discourse-dependency F1 (Molweni). This is the prerequisite-skill layer.
Simulated-user rollout + state-diff. Stand up LLM participants, run a multi-turn rollout, and grade the delta they produce: consensus change and mediator-effectiveness slope (ProMediate), engagement/evenness/consensus (MUCA), tool-action/DB end-state with
pass^kreliability (tau2-bench), milestone attribution (MultiAgentBench). Always paired with a fidelity check — USI (Sim2Real-Gap) or PT3 (RealUserSim) — because over-cooperative sims inflate the grade.LLM-as-judge social rubrics. Multi-dimension 1-5 / Likert rubrics for the qualities that have no exact-match answer: SOTOPIA's 7 dimensions, PersonaScore's 5 axes, ProMediate's Mediator Intelligence. SOTOPIA's GPT-4-judge-validated-against-humans is the calibration template — with the documented caveat that judges are weaker on subtle dimensions (Social Rules, Secret), and judge reliability across many speakers in long transcripts is largely unvalidated.
Group-specific safety / leakage. A uniquely multi-party axis: what does the agent reveal to other participants? Measured as contextual-integrity leakage rate (MAGPIE), and as speaker-grounded belief/audience-adaptation accuracy (GroupMemBench).
The recurring shape: a cheap, automatable classification gate (addressee, when-to-speak, disentanglement) + an expensive judge/rollout layer (social rubric, state-diff) + a fidelity check on the simulated humans.
4. Open gaps (the genuine frontier)
- No live, human-in-the-loop, embedded-agent benchmark. Everything is agent-vs-agent or static-transcript classification. An LLM as one participant in a sustained real channel with interruptions and side-threads is unmeasured.
- Silence-as-correct in text is unbenchmarked. Text MPC benchmarks assume the agent always replies; there is no adopted metric for over-eager interjection / false-positive responding in text group chat.
- Everything is dyadic-or-triadic. Genuine N>3 dynamics (overlapping threads, shifting floor) are barely covered, and triadic VAP shows dyadic models degrade as parties grow — existing metrics may not transfer.
- No unified scorecard. Addressee accuracy, evenness, leakage rate, disentanglement VI, and milestone KPIs are siloed per paper. There is no joint metric scoring whether + when + to whom + what in one pass, and no "group-chat competence" leaderboard.
- Speaker-grounded memory / belief tracking is barely off the ground. GroupMemBench (2026) is the first standardized probe; best systems sit at 46%.
- Simulator fidelity is unsolved and under-reported. Sims are "easy mode"; almost no group-chat eval reports a USI/PT3-style fidelity score alongside agent grades, and persona/decoding configs that move scores are rarely pinned or released.
- Judge validity at multi-party scale is unvalidated. LLM-judge bias is documented dyadically (SOTOPIA); no calibrated, speaker-attributing judge or inter-rater protocol exists for long multi-party transcripts.
- Multi-party safety beyond privacy is nearly absent — colluding participants, cross-user prompt injection, moderation of harmful cross-user dynamics.
- Real-platform grounding is thin. Most agent-era data is synthetic/GPT-generated; few use real Slack/Discord/Teams traces, so production ecological validity is unverified.
5. Relevance to our demo
Honest take: yes, there's a demoable toy — a narrow, sharp one — but it's a half-chapter extension, not a new module. The strongest single asset is that the most distinctive group-chat skill (addressee resolution) is cleanly automatable and demonstrably broken in frontier models — which is exactly the kind of "watch it fail with your own input" insight the curriculum is built around.
Concrete demoable toy: an "Is this for me, and should I speak?" gate.
- Hand the agent a short 3-4-person transcript (synthetic, hardcoded in
data/like our other toy sets) and ask it to (a) pick the addressee of the last message and (b) decide respond / react / stay-silent. - Grade addressee against a gold label as accuracy vs a majority-class baseline (the Inoue et al. framing) — this viscerally shows "the model barely beats always-guessing." That's a great teaching beat.
- This is a near-exact transplant of our existing ch02 (code grader) exact-match/classification pattern, just over a multi-party transcript instead of a math answer.
How it extends specific existing chapters:
- ch06 (trajectory grader) — the natural host. A trajectory through a group chat adds a per-step addressee and speak/stay-silent decision to grade alongside tool calls. DICE-Bench's "params scattered across speakers" and ProMediate's intervention-latency are the conceptual upgrades; we'd toy-ify them, not import them.
- ch05 (LLM as judge) — directly reusable. SOTOPIA's 7-dimension rubric and the GPT-4-judge-validated-against-human pattern are exactly ch05's mechanism; a "social appropriateness" rubric (was the chime-in warranted? was it addressed to the right person?) drops into our existing judge with a different rubric, reusing ch04's rubric-design widget.
- ch03 (human as judge) — the calibration story carries over: MUCA's "56% of users said it chimed in too much" is the human-scored restraint signal, and our ch07 (graders disagree) calibration view already compares human vs judge.
What to not oversell: the full research frontier (live human channels, N>3 floor management, simulator fidelity, speaker-grounded memory) is heavy and unsettled — not toy-able in under three minutes. The simulator-fidelity caution (Sim2Real-Gap: "your fake users are too nice") is a genuinely good concept slide but not a runnable demo. Recommend: a single chapter ("Who is this for, and should I answer?") built on addressee-accuracy-vs-baseline + a social-rubric judge, explicitly reusing ch02/ch05/ch03 machinery. That's real, honest, and finishes in three minutes; anything larger is a research project, not a mentorship demo.