Every named resource cited across the two reports, consolidated. The eval-mechanics sources are benchmarks, datasets, and frameworks (from the Eval Survey); the conversation-science sources are the foundational theory (from What to Eval For).
Verification column: ✓ independently confirmed real / correctly attributed during the research run · ⚠ flagged inaccurate or self-corrected (see the report's inline note) · — not verifiable. Raw verified data: eval-survey · dynamics.
Eval-mechanics sources (39)
Benchmarks, datasets, and frameworks for measuring group-chat agents.
| Resource | Kind | Year | ✓ | What it is |
|---|---|---|---|---|
| Addressee and Response Selection for Multi-Party Conversation (Ouchi & Tsuboi) | dataset | 2016 | ✓ | Foundational EMNLP 2016 paper that formalized the joint task of selecting BOTH whom an agent addresses and what it says in a multi-party conversation. Released a large… |
| Addressee and Response Selection for Multi-Party Conversation (Ubuntu IRC benchmark) | benchmark | 2016 | ✓ | The foundational EMNLP 2016 paper (Ouchi & Tsuboi) that defines the joint Addressee-and-Response-Selection (ARS) task on the Ubuntu Multiparty Conversation Corpus: given… |
| Addressee and Response Selection (Ubuntu IRC / Hu et al.) | dataset | 2018 | ✓ | Canonical pre-LLM multi-party task and dataset built from Ubuntu IRC chat logs, where speakers play sender/addressee/observer roles. The system must pick both the correct… |
| A Large-Scale Corpus for Conversation Disentanglement (Kummerfeld et al.) / irc-disentanglement | dataset | 2019 | ✓ | ACL 2019 release of 77,563 #Ubuntu/#Linux IRC messages manually annotated with reply-to (parent-child) links forming reply-structure graphs that both disentangle interleaved… |
| Who Is Speaking to Whom? W2W model (Le, Hu et al.) | paper | 2019 | ✓ | EMNLP-IJCNLP 2019 paper introducing the who-to-whom (W2W) model that identifies the addressee of EVERY utterance in a session jointly, not just the next response. Uses… |
| Molweni | dataset | 2020 | ✓ | A multi-party dialogue machine-reading-comprehension dataset sampled from the Ubuntu Chat Corpus: 9,754 dialogues, 86,042 utterances, 30,066 question-answer pairs, annotated… |
| Molweni | dataset | 2020 | ✓ | COLING 2020 machine-reading-comprehension dataset over multiparty dialogue, sampled from the Ubuntu Chat Corpus: 10,000 dialogs / 88,303 utterances, 30,066 questions (incl.… |
| TurnGPT | paper | 2020 | ✓ | A GPT-2-based language model (Ekstedt & Skantze) that predicts turn-shifts by adding Transition Relevance Place (TRP) tokens to the vocabulary, projecting turn completion from… |
| WHO Says WHAT to WHOM: A Survey of Multi-Party Conversations | paper | 2022 | ✓ | IJCAI 2022 survey framing multi-party conversation research around the three coupled questions of WHO (speaker), WHAT (utterance), and to WHOM (addressee), surveying tasks,… |
| NormBank (SCENE taxonomy) | dataset | 2023 | ✓ | A knowledge bank of 155k situational social norms (Ziems et al., ACL 2023) where each norm is grounded in a multivalent sociocultural frame — setting, agent roles, attributes,… |
| SOTOPIA / SOTOPIA-Eval | benchmark | 2023 | ✓ | An open-ended environment that simulates goal-driven social interactions between LLM agents who role-play diverse character profiles with private goals and relationship… |
| Large Language Models Know What To Say But Not When To Speak (TRP benchmark) | paper | 2024 | ✓ | An EMNLP 2024 Findings paper (Umair, Sarathy, de Ruiter) that introduces a dataset of participant-labeled within-turn Transition Relevance Places (TRPs) in unscripted spoken… |
| Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction | paper | 2024 | ✓ | Combines an LLM (Llama, semantic/syntactic context) with a Voice Activity Projection model (acoustic cues) via an LSTM ensemble to predict turn-taking opportunities, fusing… |
| MUCA + MUS (Multi-User Chat Assistant / Multi-User Simulator) | framework | 2024 | ✓ | Described as the first LLM framework dedicated to multi-user group conversations, organized around the 3W design (What to say, When to respond, Who to answer) via a Sub-topic… |
| MUCA: Multi-User Chat Assistant framework | framework | 2024 | ✓ | An LLM framework for facilitating group text conversations whose Utterance Strategies Arbitrator explicitly decides the What/When/Who of a bot utterance, using an 'in-context… |
| PersonaGym / PersonaScore | benchmark | 2024 | ✓ | The first dynamic evaluation framework for persona agents (200 personas, 10k questions, 150 environments), with PersonaScore as an automated, human-aligned metric grounded in… |
| RENOVI: Remediating Norm Violations in Socio-Cultural Conversations | benchmark | 2024 | ✓ | A large-scale corpus of 9,258 multi-turn dialogues (512 human-authored + 8,746 ChatGPT-synthesized) annotated with social norms, designed to evaluate detecting and remediating… |
| Addressee Recognition in Multi-modal Multi-party Dialogue (LLM benchmark) | benchmark | 2025 | ✓ | A benchmark built on a multi-modal corpus of triadic (3-participant) discussions that tests whether an LLM can identify the addressee — who is being spoken to / who should take… |
| An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue | benchmark | 2025 | ✓ | A benchmark built on a multimodal triadic (3-party) dialogue corpus with addressee annotations (explicit addressees occur in ~20% of turns), testing whether LLMs can identify… |
| An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue (Inoue et al., TEIDAN) | benchmark | 2025 | ✓ | Kyoto University benchmark testing whether modern LLMs (GPT-4o) can do addressee recognition and next-speaker prediction in spontaneous triadic (3-person) dialogue, using the… |
| Beyond Words: Multimodal LLM Knows When to Speak | paper | 2025 | ✓ | Builds a dataset annotated for turn-taking labels, backchannel signals (e.g. 'mm-hmm'), and speech timing from Fisher, MAHNOB-HCI, and Harper Valley Bank corpora, training a… |
| DEBATE: Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates | benchmark | 2025 | ✓ | A large-scale benchmark (30,707 messages, 2,832 U.S. participants, 708 groups, 107 topics) that measures whether multi-agent role-playing LLMs reproduce authentic human group… |
| DICE-Bench | benchmark | 2025 | ✓ | Dialogue-based Interactive Calling Evaluation Benchmark: the first multi-round, multi-party benchmark for function/tool-calling grounded in realistic group-chat data. 1,607… |
| HSII (How Social Is It?) | benchmark | 2025 | ✓ | A benchmark explicitly built to assess LLMs as autonomous social agents in multi-user, multi-turn settings (avg 6.72 participants, 7.8 turns per scenario), as opposed to… |
| MAGPIE | benchmark | 2025 | ✓ | Multi-AGent contextual PrIvacy Evaluation: ~200 high-stakes tasks (earlier version: 158 scenarios across 15 domains) evaluating privacy preservation in multi-agent,… |
| Multi-Party Conversational Agents: A Survey | paper | 2025 | ✓ | A survey organizing multi-party conversational-agent research, with explicit sub-sections on Turn Detection (when to speak) and Addressee Selection (whom to address),… |
| Multi-Party Conversational Agents: A Survey | paper | 2025 | ✓ | A 2025 survey of multi-party conversational agents that organizes the field around the sub-capabilities required for group settings, including who-speaks-next / turn-taking,… |
| MultiAgentBench | benchmark | 2025 | ✓ | A benchmark evaluating LLM-based multi-agent systems across interactive scenarios with both cooperative (mutual-goal) and competitive (conflicting-goal) settings, supporting… |
| MultiAgentBench (MARBLE) | benchmark | 2025 | ✓ | A benchmark suite (Zhu et al.) for LLM multi-agent systems across cooperative (research collab, Minecraft build, DB diagnosis, coding) and competitive (bargaining, Werewolf… |
| Multimodal Conversation Structure Understanding (MCSU) | benchmark | 2025 | ✓ | A 2025 benchmark for evaluating (multimodal) LLMs on the structural fabric of multi-party conversation — including speaker/addressee and reply-to relations — beyond surface… |
| ProMediate | framework | 2025 | ✓ | A socio-cognitive framework (USC + Microsoft) for evaluating proactive AI mediator agents in multi-topic, multi-party negotiations. Includes a simulation testbed with… |
| SAGE | framework | 2025 | ✓ | A top-down/bottom-up knowledge-grounded user simulator for multi-turn agent evaluation (Columbia DAPLab, Findings of EACL 2026). Grounds simulated users in business logic… |
| tau2-bench (τ²-bench) | benchmark | 2025 | ✓ | Sierra's benchmark for tool-agent-user interaction. τ²-bench extends τ-bench to a dual-control setting (Telecom domain) where BOTH the simulated user and the agent can call… |
| The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation | framework | 2025 | ✓ | A psychometric evaluation framework (Zarreen Reza) that assesses LLMs as social actors inside multi-agent debates rather than in isolation, using a 3-round multi-party debate… |
| Triadic Multi-party Voice Activity Projection (VAP) for Turn-taking | paper | 2025 | ✓ | First extension of Voice Activity Projection to triadic (3-party) spoken conversation, predicting each speaker's future voice activity from acoustics to determine who takes the… |
| GroupMemBench | benchmark | 2026 | ✓ | A benchmark for LLM agent MEMORY specifically in multi-party conversations, motivated by the fact that nearly all memory benchmarks assume a dyadic single-user setup while real… |
| Mind the Sim2Real Gap in User Simulation for Agentic Tasks | paper | 2026 | ✓ | A CMU LTI study quantifying how faithfully LLM user simulators replicate real human behavior in agent interactions, and how that mismatch distorts benchmark scores. |
| MPCEval: A Benchmark for Multi-Party Conversation Generation | benchmark | 2026 | ✓ | A standardized, task-aware framework for evaluating multi-party conversation generation, covering both next-message prediction and full-conversation generation across varied… |
| RealUserSim | framework | 2026 | ✓ | A user-simulation framework grounded in real behavioral data: extracts 7,275 executable behavioral profiles from 14,000+ authentic human-LLM conversations (WildChat) and uses… |
Conversation-science sources (25)
Foundational constructs from conversation analysis, the social psychology of groups, and sociolinguistics/pragmatics that ground the eval criteria.
| Construct | Seminal source | Discipline | ✓ |
|---|---|---|---|
| Adjacency pairs & conditional relevance | Schegloff & Sacks 1973, 'Opening up closings', Semiotica 8:289–327; elaborated in Schegloff 2007, 'Sequence Organization in Interaction' | Conversation Analysis & Interactional Linguistics (organization of multi-party talk) | ✓ |
| Audience Design & Style-Shifting | Bell 1984, "Language Style as Audience Design" (Language in Society 13:145-204) | Sociolinguistics & Pragmatics (multi-party conversation) | ✓ |
| Audience effects / social facilitation & evaluation apprehension | Zajonc 1965, "Social Facilitation" (Science) — mere-presence drive theory; Cottrell 1972 evaluation-apprehension refinement | Social psychology of group conversation and small-group dynamics | ✓ |
| Common ground & grounding (least collaborative effort, grounding criterion) | Clark & Brennan 1991, "Grounding in Communication" (in Resnick et al., eds., Perspectives on Socially Shared Cognition); building on Clark & Wilkes-Gibbs 1986 | Social psychology of group conversation and small-group dynamics | ✓ |
| Communication Accommodation Theory (convergence / divergence) | Giles 1973; Giles, Coupland & Coupland 1991 (Contexts of Accommodation); Giles & Ogay 2007 review | Social psychology of group conversation and small-group dynamics | ✓ |
| Contextualization Cues, Code-Switching & Register | Gumperz 1982, "Discourse Strategies" (conversational code-switching, contextualization cues) | Sociolinguistics & Pragmatics (multi-party conversation) | ✓ |
| Cooperative Principle & Conversational Maxims (Implicature) | Grice 1975, "Logic and Conversation" (in Syntax and Semantics 3) | Sociolinguistics & Pragmatics (multi-party conversation) | ✓ |
| Cultural Variation in Face & Politeness (Discernment / Wakimae) | Matsumoto 1988 and Ide 1989 (critiques of Brown & Levinson using Japanese); concept of wakimae/discernment politeness | Sociolinguistics & Pragmatics (multi-party conversation) | ✓ |
| Face & Face-Threatening Acts (FTAs) | Brown & Levinson 1987, "Politeness: Some Universals in Language Usage" (building on Goffman); also Brown & Levinson 1978 | Sociolinguistics & Pragmatics (multi-party conversation) | ✓ |
| Face-work, Deference & Demeanor | Goffman 1955, "On Face-Work" (Psychiatry 18:213-231); Goffman 1956/1967, "The Nature of Deference and Demeanor" | Sociolinguistics & Pragmatics (multi-party conversation) | ✓ |
| Floor management with 3+ parties (selection, schisming, addressee vs. next-speaker) | Sacks, Schegloff & Jefferson 1974 (next-speaker selection); Egbert 1997 (schisming); Auer 2018 / Lerner 2003 (gaze, addressing, next-speaker selection in 3-party talk) | Conversation Analysis & Interactional Linguistics (organization of multi-party talk) | ✓ |
| Floor-control & participation inequality (turn-taking, dominance, silencing) | Sacks, Schegloff & Jefferson 1974 (turn-taking systematics); Edelsky 1981 "Who's got the floor?" (singly-developed F1 vs. collaborative F2 floor); conversational-dominance literature | Social psychology of group conversation and small-group dynamics | ✓ |
| Group polarization | Moscovici & Zavalloni 1969, "The group as a polarizer of attitudes" (J. Personality & Social Psychology); related risky-shift work, Stoner 1961 | Social psychology of group conversation and small-group dynamics | ✓ |
| Indirect Speech Acts | Searle 1975, "Indirect Speech Acts" (in Syntax and Semantics 3); Searle 1969, "Speech Acts" | Sociolinguistics & Pragmatics (multi-party conversation) | ✓ |
| Informational vs. normative social influence (conformity) | Deutsch & Gerard 1955, "A Study of Normative and Informational Social Influences upon Individual Judgment" (J. Abnormal & Social Psychology); rooted in Asch 1951/1956 conformity line experiments | Social psychology of group conversation and small-group dynamics | ✓ |
| Overlap & overlap-resolution | Schegloff 2000, 'Overlapping talk and the organization of turn-taking for conversation', Language in Society 29(1):1–63 (extends Sacks/Schegloff/Jefferson 1974) | Conversation Analysis & Interactional Linguistics (organization of multi-party talk) | ✓ |
| Participation Framework & Footing | Goffman 1981, "Footing" (in Forms of Talk) | Sociolinguistics & Pragmatics (multi-party conversation) | ✓ |
| Participation framework & footing (Goffman) | Goffman 1981, 'Footing', in Forms of Talk, University of Pennsylvania Press; extended by C. Goodwin & M. H. Goodwin 2004, 'Participation' | Conversation Analysis & Interactional Linguistics (organization of multi-party talk) | ✓ |
| Recipient design (audience design) | Sacks & Schegloff (recipient design, e.g., Sacks 1992 Lectures; Sacks & Schegloff 1979 on reference); cf. H. H. Clark & Murphy 1982 'audience design'; Giles' accommodation | Conversation Analysis & Interactional Linguistics (organization of multi-party talk) | ✓ |
| Repair (self/other-initiation, self/other-repair) | Schegloff, Jefferson & Sacks 1977, 'The Preference for Self-Correction in the Organization of Repair in Conversation', Language 53:361–382 | Conversation Analysis & Interactional Linguistics (organization of multi-party talk) | ✓ |
| Sequence organization & expansion (pre-/insert-/post-) | Schegloff 2007, 'Sequence Organization in Interaction: A Primer in Conversation Analysis, Vol. 1', Cambridge University Press | Conversation Analysis & Interactional Linguistics (organization of multi-party talk) | ✓ |
| Social loafing / free-riding (and effort responsibility) | Latané, Williams & Harkins 1979, "Many hands make light the work: The causes and consequences of social loafing" (J. Personality & Social Psychology); Karau & Williams 1993 meta-analysis | Social psychology of group conversation and small-group dynamics | ✓ |
| Status hierarchies & expectation states in talk | Berger, Cohen & Zelditch 1972 and Berger, Fisek, Norman & Zelditch 1977 (Expectation States / Status Characteristics Theory); Ridgeway & Berger syntheses | Social psychology of group conversation and small-group dynamics | ✓ |
| Theory of mind & perspective-taking in groups (egocentric anchoring) | Keysar, Barr, Balin & Brauner 2000, "Taking Perspective in Conversation" (Psychological Science) — egocentric anchoring & adjustment; Premack & Woodruff 1978 on theory of mind | Social psychology of group conversation and small-group dynamics | ✓ |
| Turn-taking systematics (TRPs & turn-allocation) | Sacks, Schegloff & Jefferson 1974, 'A Simplest Systematics for the Organization of Turn-Taking for Conversation', Language 50:696–735 | Conversation Analysis & Interactional Linguistics (organization of multi-party talk) | ✓ |