[
  {
    "subtopic": "Benchmarks and datasets for evaluating LLM agents in group-chat / multi-party (3+ participant) conversation settings",
    "gaps": [
      "Almost no benchmark evaluates an LLM agent embedded as ONE participant in a human group chat over a sustained session with realistic interruptions, side-conversations, and overlapping threads; most are either agent-to-agent (MultiAgentBench, MAGPIE) or static-dialogue MRC (Molweni). End-to-end live group-chat agent eval with humans-in-the-loop is largely missing.",
      "'When to respond' / floor-management (deciding whether to speak at all, when to stay silent, interruption timing) is under-measured \u2014 MUCA introduces engagement/evenness metrics but there is no widely adopted benchmark with a principled silence/turn-taking ground truth.",
      "Speaker attribution and belief tracking is only just emerging (GroupMemBench, 2026); there is no mature, standardized eval for whether an agent correctly maintains per-speaker mental models across long multi-user channels.",
      "Group-chat-specific social dynamics \u2014 coalition formation, dominance/participation imbalance, side-channeling, sarcasm/in-group reference \u2014 lack quantitative evaluation harnesses; HSII gestures at sociological leveling but metrics remain coarse.",
      "Real-platform grounding is thin: most agent-era benchmarks use synthetic/GPT-generated dialogues; few use real Slack/Discord/Teams traces (privacy + access constraints), so ecological validity for production deployment is unverified.",
      "Evaluation of multi-party safety beyond privacy (e.g., an agent being manipulated by colluding participants, prompt injection via one user aimed at another, or moderation of harmful cross-user dynamics) is nearly absent.",
      "No standard metric reconciliation: addressee-selection accuracy (ADR), engagement/evenness, leakage rate, and milestone KPIs are siloed per paper \u2014 there is no unified scorecard or leaderboard for 'group-chat agent competence' the way there is for single-turn tool use."
    ],
    "findings": [
      {
        "name": "HSII (How Social Is It?)",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://arxiv.org/abs/2505.04628",
        "whatItIs": "A benchmark explicitly built to assess LLMs as autonomous social agents in multi-user, multi-turn settings (avg 6.72 participants, 7.8 turns per scenario), as opposed to one-on-one assistants. Includes the HSII-Dataset (8,305 samples) derived from real news articles via GPT-4 scenario extraction plus human refinement, and a sociology-grounded agent-task-leveling framework.",
        "evalMethod": "Cascading 4-stage scoring: (1) Format Parsing rate r1, (2) Target Selection rate r2 (picking the correct conversational partner \u2014 i.e. addressee selection), (3) target-switching first-utterance quality r3, (4) sustained multi-turn stability r4. Combined into a single HSII score \u03b9=r1(1+\u03b1\u00b7r2(1+\u03b2(r3+\u03b3\u00b7r4))) so a parse failure zeroes out everything downstream. Also defines a COT-complexity metric (avg chain-of-thought iterations to reach 70% target-selection accuracy). Reports GPT-4 1.399 vs human 2.149.",
        "relevance": "One of the only benchmarks that directly measures the distinctive group-chat skills: deciding WHO to address and WHEN to switch targets among many participants, plus staying coherent across a multi-party thread. Maps almost 1:1 to the 'who to answer / when to respond' problem in a Slack/Discord channel.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2505.04628",
          "note": "Confirmed real (Wu, Xiong, Deng 2025). Dataset 8,305 samples, avg 6.722 participants / 7.801 turns, 4-stage formula \u03b9=r1(1+\u03b1r2(1+\u03b2(r3+\u03b3r4))) all verified. Minor: the 1.399 vs 2.149 figures are overall HSII scores, not the COT-complexity metric as implied."
        }
      },
      {
        "name": "DICE-Bench",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://arxiv.org/abs/2506.22853",
        "whatItIs": "Dialogue-based Interactive Calling Evaluation Benchmark: the first multi-round, multi-party benchmark for function/tool-calling grounded in realistic group-chat data. 1,607 instances over up to 4 rounds with 2-4 participants, built from a 124-tool / 270-dependency tool graph plus persona-driven multi-agent dialogue generation.",
        "evalMethod": "Three-stage validation pipeline: G-Eval LLM scoring (coherence/consistency/fluency), rule-based filtering, then human criteria filtering. Models are scored with Exact Match (EM) on the function call and a novel DICE-Score that quantifies how fragmented/dispersed the tool-relevant parameters are across turns and speakers (higher = harder). Validated by both rule-based and human evaluation.",
        "relevance": "Captures the core group-chat hardness that single-utterance tool benchmarks miss: the parameters an agent needs are scattered across multiple turns AND multiple speakers, so the agent must track who said what before it can act. Directly models an agent acting inside a multi-person channel.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2506.22853",
          "note": "Confirmed real (ACL 2025 Findings, arXiv 2506.22853). Eval-method description accurate: 3-stage pipeline (G-Eval coherence/consistency/fluency, rule-based, human criteria filtering), Exact Match + DICE-Score dispersion metric, 1,607 instances, up to 4 rounds, 2-4 participants, 124-node/270-edge tool graph."
        }
      },
      {
        "name": "GroupMemBench",
        "kind": "benchmark",
        "year": "2026",
        "url": "https://arxiv.org/abs/2605.14498",
        "whatItIs": "A benchmark for LLM agent MEMORY specifically in multi-party conversations, motivated by the fact that nearly all memory benchmarks assume a dyadic single-user setup while real deployments are groups/channels with multiple users talking to the agent and each other.",
        "evalMethod": "Probes three group-specific memory properties: (i) group dynamics beyond concatenated 1:1 chats, (ii) speaker-grounded belief tracking (per-user/per-speaker memory modeling \u2014 who believes/said what), and (iii) audience-adapted language requiring Theory-of-Mind to produce role-specific vocabulary. Grades whether the agent extracts/retrieves/applies the right speaker-attributed information.",
        "relevance": "The clearest statement of the speaker-attribution and per-participant belief-tracking problem that defines group-chat eval. Useful for evaluating whether an agent in a channel correctly remembers who said what and adapts to its audience, rather than blending everyone into one user.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2605.14498",
          "note": "Confirmed real on arXiv (2605.14498); title, multi-party-vs-dyadic motivation, and all three memory properties (group dynamics, speaker-grounded belief tracking, audience-adapted ToM language) match the abstract exactly."
        }
      },
      {
        "name": "MUCA + MUS (Multi-User Chat Assistant / Multi-User Simulator)",
        "kind": "framework",
        "year": "2024",
        "url": "https://arxiv.org/abs/2401.04883",
        "whatItIs": "Described as the first LLM framework dedicated to multi-user group conversations, organized around the 3W design (What to say, When to respond, Who to answer) via a Sub-topic Generator, Dialog Analyzer, and Conversational Strategies Arbitrator. Ships an LLM-based Multi-User Simulator (MUS) to simulate group participants for optimization/evaluation.",
        "evalMethod": "Quantitative metrics over group conversations: user engagement, conversation evenness (participation balance across users), and opinion consensus; plus subjective user ratings of efficiency, conciseness, and usefulness. MUS enables simulated multi-party conversations to measure the bot's timing/content/addressing decisions.",
        "relevance": "Provides both an eval harness (the MUS simulator) and group-specific metrics (evenness, engagement, consensus) that are otherwise absent from dyadic eval. The 'When/Who to respond' framing and the simulator are directly reusable for evaluating an agent dropped into a group chat.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2401.04883",
          "note": "Confirmed: MUCA (Sub-topic Generator, Dialog Analyzer, Conversational Strategies Arbitrator) + 3W design + MUS simulator all match the arXiv paper; eval description (engagement/evenness/consensus + subjective ratings via simulated group convos) is roughly accurate."
        }
      },
      {
        "name": "MAGPIE",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://arxiv.org/abs/2510.15186",
        "whatItIs": "Multi-AGent contextual PrIvacy Evaluation: ~200 high-stakes tasks (earlier version: 158 scenarios across 15 domains) evaluating privacy preservation in multi-agent, collaborative, non-adversarial conversations where private info is essential to solving the task, so it cannot be trivially withheld.",
        "evalMethod": "Measures contextual-integrity privacy leakage during multi-agent collaboration: the percentage of sensitive information an agent leaks to other agents while still completing the shared task. Reports leakage rates (e.g., Gemini 2.5-Pro up to 50.7%, GPT-5 up to 35.1%) even when explicitly told not to share.",
        "relevance": "Evaluates a uniquely multi-party failure mode \u2014 what an agent reveals to OTHER participants in a shared channel. Information-disclosure control across participants is a group-chat-specific safety axis that single-user evals cannot surface.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2510.15186",
          "note": "Confirmed real: arXiv 2510.15186 \"MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evaluation\" (200 high-stakes tasks). Eval-method description accurate \u2014 measures privacy leakage in multi-agent collaborative non-adversarial tasks where private info is essential; reports Gemini 2.5-Pro up to 50.7% and GPT-5 up to 35.1% leakage despite instructions not to share."
        }
      },
      {
        "name": "MultiAgentBench",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://aclanthology.org/2025.acl-long.421/",
        "whatItIs": "A benchmark evaluating LLM-based multi-agent systems across interactive scenarios with both cooperative (mutual-goal) and competitive (conflicting-goal) settings, supporting star/tree/graph/chain coordination topologies.",
        "evalMethod": "Milestone-based key performance indicators that score not just task completion but the QUALITY of collaboration and competition, plus emergent multi-agent interaction behavior, evaluated across the different coordination protocols.",
        "relevance": "Although agent-to-agent rather than human-in-channel, it is one of the few benchmarks measuring interaction quality among 3+ participants and how communication topology affects outcomes \u2014 directly transferable to evaluating coordination in a shared multi-party channel.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://aclanthology.org/2025.acl-long.421/",
          "note": "Confirmed real (ACL 2025 Main, arXiv 2503.01935). Eval-method description accurate: milestone-based KPIs scoring collaboration/competition quality, cooperative+competitive settings, and star/tree/graph/chain coordination topologies all verified."
        }
      },
      {
        "name": "Addressee and Response Selection (Ubuntu IRC / Hu et al.)",
        "kind": "dataset",
        "year": "2018",
        "url": "https://arxiv.org/abs/1709.04005",
        "whatItIs": "Canonical pre-LLM multi-party task and dataset built from Ubuntu IRC chat logs, where speakers play sender/addressee/observer roles. The system must pick both the correct addressee and the correct response from candidates given the multi-party context (Zhang, Lee, Polymenakos, Radev; AAAI 2018, building on Ouchi & Tsuboi 2016).",
        "evalMethod": "Two classification accuracies: addressee selection (ADR) accuracy \u2014 choosing the right interlocutor to respond to \u2014 and response selection (RES) accuracy \u2014 choosing the correct reply from a fixed candidate set, evaluated at varying numbers of participants and history lengths.",
        "relevance": "The foundational formalization of the 'who is this aimed at / who do I reply to' problem unique to group chat. The addressee-recognition metric is the historical root of modern group-chat target-selection evals (e.g., HSII's r2 stage) and remains a clean, automatable group-chat probe.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/1709.04005",
          "note": "Confirmed: Zhang, Lee, Polymenakos, Radev \"...Speaker Interaction RNNs\" (AAAI 2018, arXiv 1709.04005); Ubuntu IRC multi-party data, ADR + RES accuracy eval across participant counts/history lengths \u2014 all accurate."
        }
      },
      {
        "name": "Molweni",
        "kind": "dataset",
        "year": "2020",
        "url": "https://arxiv.org/pdf/2004.05080",
        "whatItIs": "A multi-party dialogue machine-reading-comprehension dataset sampled from the Ubuntu Chat Corpus: 9,754 dialogues, 86,042 utterances, 30,066 question-answer pairs, annotated with discourse structure (reply-to links and edge types) over multi-party threads.",
        "evalMethod": "Evaluated as MRC: given a multi-party dialogue, models answer questions (including unanswerable ones) \u2014 scored with standard span-extraction F1/Exact Match \u2014 and separately on discourse-parsing of the reply-to dependency structure (who-replies-to-whom links and relation types).",
        "relevance": "A canonical, fully annotated corpus of tangled multi-party threads with explicit reply-to / discourse graphs \u2014 the ground truth needed to test whether an agent correctly untangles interleaved conversation threads and attributes utterances, a prerequisite skill for any group-chat agent.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2004.05080",
          "note": "Real (COLING 2020). Eval method accurate: span MRC F1/EM with unanswerable Qs + separate discourse dependency parsing. QA pairs (30,066) match; abstract cites ~10,000 dialogues/88,303 utterances vs claimed 9,754/86,042 (released-split counts) \u2014 minor."
        }
      }
    ]
  },
  {
    "subtopic": "Evaluating turn-taking and when-to-speak / response-timing decisions for conversational agents in multi-party settings",
    "gaps": [
      "No standard benchmark grades 'silence as the correct action' in TEXT group chat. The silence-as-correct work (Beyond Words, RESPOND-style) is acoustic/spoken; text multi-party benchmarks (Ubuntu ARS) assume the agent always produces a response and only ask which response \u2014 they cannot score correctly choosing NOT to speak. There is no widely adopted text-group-chat metric for over-eager responding / false-positive interjections.",
      "Turn-taking and addressee work is overwhelmingly dyadic or, at best, triadic. Genuine N>3 group-chat dynamics (overlapping threads, sub-conversations, shifting floor) are barely covered, and the triadic VAP result shows dyadic models degrade as parties increase \u2014 so existing metrics may not transfer to real group chats.",
      "Metrics are fragmented and capability-siloed: F1 for turn detection (CCPE), accuracy for addressee selection (Ubuntu IRC), IoU/BLEU for backchannel timing/content. There is no unified, joint metric that scores an agent's combined decision of WHETHER to speak + WHEN + TO WHOM + WHAT in one pass, which is what a deployed group-chat agent actually does.",
      "Evaluation is largely offline/classification-based against fixed human-labeled turn points, which ignores that in live multi-party chat there is no single ground-truth correct moment \u2014 multiple silences/responses can be acceptable. Few works evaluate timing decisions in a closed-loop, interactive setting where the agent's own (non-)response changes subsequent context.",
      "LLM baselines are strikingly weak and under-diagnosed: GPT-4o is near chance on multi-party addressee/next-speaker recognition (2501.16643) and LLMs are near-random on TRP timing (2410.16044), but there is little analysis of WHY (context-window handling of interleaved speakers, lack of timing signal in text, prompt format) or how to repair it. No agent-eval harness isolates the when-to-speak gate as a gradable component.",
      "Backchannel / minimal-acknowledgment behavior (the 'mm-hmm' equivalent in chat, e.g. a reaction vs full reply) is studied mostly in spoken/multimodal corpora with timing-based metrics (IoU, BLEU/ROUGE for content); its evaluation is essentially absent for text-based group-chat agents, where 'react vs reply vs ignore' is a real but ungraded decision.",
      "Production-style metrics (e.g. MUCA's excessive-chime-in % and participation-evenness) come from bespoke human studies on single systems and are not standardized or reproducible across agents, so there is no comparable cross-system leaderboard for restraint/timing quality in group chat."
    ],
    "findings": [
      {
        "name": "Large Language Models Know What To Say But Not When To Speak (TRP benchmark)",
        "kind": "paper",
        "year": "2024",
        "url": "https://arxiv.org/abs/2410.16044",
        "whatItIs": "An EMNLP 2024 Findings paper (Umair, Sarathy, de Ruiter) that introduces a dataset of participant-labeled within-turn Transition Relevance Places (TRPs) in unscripted spoken dialogue, used to test whether LLMs can predict WHEN it is appropriate to speak rather than what to say.",
        "evalMethod": "Frames when-to-speak as binary classification of each candidate point as a TRP or not. Grades model-predicted speaking opportunities against crowd/participant-labeled and expert-labeled TRPs using precision, recall, and F1 (reported F1 ~0.16 expert / ~0.14 participant condition), explicitly contrasting human-labeled timing against model predictions.",
        "relevance": "Directly operationalizes the 'when to speak' decision as a labeled prediction task and shows SOTA LLMs perform near-randomly at timing even though their content is fluent. The clearest demonstration that response-timing is a distinct, hard, separately-gradable capability from response-content.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2410.16044",
          "note": "Real EMNLP 2024 Findings paper (Umair, Sarathy, de Ruiter); URL correct. Eval description roughly accurate: paper decomposes turns into binary TRP classification tasks graded by precision/recall/F1 with expert vs participant conditions, and reports low F1 (~0.14-0.16 range; one model 0.151). Minor nuance: expert/participant are prompting conditions, while TRP labels come from participants (118 crowd) vs experts."
        }
      },
      {
        "name": "Beyond Words: Multimodal LLM Knows When to Speak",
        "kind": "paper",
        "year": "2025",
        "url": "https://arxiv.org/abs/2505.14654",
        "whatItIs": "Builds a dataset annotated for turn-taking labels, backchannel signals (e.g. 'mm-hmm'), and speech timing from Fisher, MAHNOB-HCI, and Harper Valley Bank corpora, training a multimodal LLM to decide whether the correct action at a given moment is to respond, backchannel, or stay silent.",
        "evalMethod": "Classification over response-opportunity vs non-response (silence) vs backchannel moments; silence/no-response is a first-class label the model must predict correctly, so the eval explicitly scores restraint (not speaking) as a correct outcome rather than only scoring generated text.",
        "relevance": "One of the few resources where 'silence is the correct action' is an explicit, graded label class. Demonstrates how to construct a benchmark that penalizes over-eager responding, which is the core failure mode of a group-chat agent that always replies.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2505.14654",
          "note": "Real paper (MM-When2Speak, Liao et al., May 2025). Eval framing is right: silence/reaction/full-response classification scored per-class P/R/F1, so restraint is a first-class correct label. BUT the corpora claim is fabricated \u2014 it uses its own 357 curated dyadic videos, NOT Fisher/MAHNOB-HCI/Harper Valley Bank."
        }
      },
      {
        "name": "An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://arxiv.org/abs/2501.16643",
        "whatItIs": "A benchmark built on a multimodal triadic (3-party) dialogue corpus with addressee annotations (explicit addressees occur in ~20% of turns), testing whether LLMs can identify who is being addressed / who should take the next turn.",
        "evalMethod": "Accuracy on addressee/next-speaker recognition against human annotations. Reported result: GPT-4o scores only marginally above chance, quantifying how poorly current LLMs handle the floor-management precondition for deciding whether to respond.",
        "relevance": "Addressee recognition is the gating sub-decision for when-to-speak: an agent must know if it is being addressed before deciding to respond. Provides a concrete accuracy metric and a strong negative baseline (GPT-4o near chance) for the group-chat 'is this for me?' problem.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2501.16643",
          "note": "Confirmed real (Inoue, Lala, Elmers, Ochi, Kawahara). Triadic multimodal corpus, ~20% turns have explicit addressees, GPT-4o scores only marginally above chance \u2014 all claims accurate."
        }
      },
      {
        "name": "Addressee and Response Selection for Multi-Party Conversation (Ubuntu IRC benchmark)",
        "kind": "benchmark",
        "year": "2016",
        "url": "https://aclanthology.org/D16-1231.pdf",
        "whatItIs": "The foundational EMNLP 2016 paper (Ouchi & Tsuboi) that defines the joint Addressee-and-Response-Selection (ARS) task on the Ubuntu Multiparty Conversation Corpus: given multi-party context, select both the correct addressee and the correct next response from candidates.",
        "evalMethod": "Addressee selection accuracy (ADR-ACC) plus response selection as recall@k retrieval (Recall@1/@2 over candidate responses), and joint accuracy requiring both correct. Later SOTA (e.g. ASRG, Song 2022) reaches ~84.65% addressee accuracy on this set.",
        "relevance": "The canonical text-based multi-party benchmark and metric scheme (addressee accuracy + response recall@k) that nearly all later group-chat response work compares against. Establishes the 'who do I address' half of the floor-taking decision.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://aclanthology.org/D16-1231.pdf",
          "note": "Confirmed real: Ouchi & Tsuboi, EMNLP 2016; defines joint ADR-RES task on Ubuntu multiparty corpus with addressee accuracy + response recall@1/@2 + joint accuracy; Song 2022 ASRG ~84.65% SOTA also checks out."
        }
      },
      {
        "name": "MUCA: Multi-User Chat Assistant framework",
        "kind": "framework",
        "year": "2024",
        "url": "https://arxiv.org/html/2401.04883v1",
        "whatItIs": "An LLM framework for facilitating group text conversations whose Utterance Strategies Arbitrator explicitly decides the What/When/Who of a bot utterance, using an 'in-context chime-in' module with silence-duration and semantic-stagnation probabilities to decide whether to intervene.",
        "evalMethod": "Evaluated via human user studies (5-point efficiency/timing ratings; % of participants reporting the bot 'chimes in excessively' \u2014 56.25% for basic vs 0% for the advanced variant) plus quantitative engagement metrics (total words, average message length, an 'evenness' metric via stddev of participation, consensus rate). Includes a Multi-User Simulator (MUS) for rapid iteration on timing without humans.",
        "relevance": "A working group-chat agent whose central design problem IS when/whether to speak, and whose evaluation directly measures over-chiming as a defect. The 'excessive chime-in %' and participation-evenness metrics are reusable signals for grading a group agent's restraint and timing.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/html/2401.04883v1",
          "note": "Confirmed: arXiv 2401.04883 (MUCA, Microsoft Research, Jan 2024). 3W (What/When/Who) Utterance Strategies Arbitrator, in-context chime-in with silence + semantic-stagnation probabilities, human user studies (5-pt ratings, 56.25% vs 0% \"chimes in excessively\"), engagement metrics, and Multi-User Simulator (MUS) all match."
        }
      },
      {
        "name": "TurnGPT",
        "kind": "paper",
        "year": "2020",
        "url": "https://aclanthology.org/2020.findings-emnlp.268/",
        "whatItIs": "A GPT-2-based language model (Ekstedt & Skantze) that predicts turn-shifts by adding Transition Relevance Place (TRP) tokens to the vocabulary, projecting turn completion from text alone so a system can decide when to take the floor.",
        "evalMethod": "Evaluated on written and spoken dialogue corpora by predicting end-of-turn / TRP tokens; reports turn-prediction performance against turn boundaries and outperforms prior end-of-turn baselines. Foundation for text-only when-to-speak prediction (and the later RC-TurnGPT / PairwiseTurnGPT variants).",
        "relevance": "The reference text-based 'projecting when to take the floor' model. Its TRP-token formulation is the template later LLM timing benchmarks (2410.16044) adopt, and it shows how to measure floor-taking decisions from text without audio \u2014 relevant to text group chats.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://aclanthology.org/2020.findings-emnlp.268/",
          "note": "Confirmed real: TurnGPT (Ekstedt & Skantze, Findings of EMNLP 2020). URL correct. GPT-2 adapted with TRP tokens, evaluated on written+spoken dialog corpora predicting turn-shifts and outperforming prior baselines \u2014 description and eval method are accurate."
        }
      },
      {
        "name": "Triadic Multi-party Voice Activity Projection (VAP) for Turn-taking",
        "kind": "paper",
        "year": "2025",
        "url": "https://arxiv.org/abs/2507.07518",
        "whatItIs": "First extension of Voice Activity Projection to triadic (3-party) spoken conversation, predicting each speaker's future voice activity from acoustics to determine who takes the turn next in a multi-party setting.",
        "evalMethod": "Trains/evaluates on a Japanese triadic dataset; predicts future joint voice-activity patterns and compares triadic-trained VAP against dyadic/baseline models on turn-shift prediction (the type of conversation measurably affected accuracy). Uses the VAP family's future-voice-activity-window prediction as the grading signal.",
        "relevance": "Pushes turn-taking/floor-prediction from 2-party to genuine multi-party, the setting closest to group chat. Demonstrates that dyadic turn-taking models degrade in 3+ party settings \u2014 quantifying the multi-party gap that group-chat agents face.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2507.07518",
          "note": "Confirmed: arXiv 2507.07518 / Interspeech 2025, Elmers et al. (Kyoto U). First VAP extension to triadic dialogue; acoustic-only future voice-activity prediction; Japanese triadic dataset; triadic-trained VAP beat baselines, conversation type affected accuracy. Eval description accurate."
        }
      },
      {
        "name": "Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction",
        "kind": "paper",
        "year": "2024",
        "url": "https://arxiv.org/abs/2412.18061",
        "whatItIs": "Combines an LLM (Llama, semantic/syntactic context) with a Voice Activity Projection model (acoustic cues) via an LSTM ensemble to predict turn-taking opportunities, fusing what-was-said with how-it-was-said.",
        "evalMethod": "Predicts turn-end / speaking-opportunity points; the multi-party survey reports Lla-VAP as SOTA on the CCPE turn-detection benchmark at F1 = 83.13%, using F1 against labeled turn-shift points as the metric.",
        "relevance": "Shows the current best recipe for when-to-speak (LLM + acoustic fusion) and supplies a concrete benchmark+metric (CCPE, F1 83.13%) for the turn-detection sub-task that a group-chat agent's speak/stay-silent gate would be measured against.",
        "_verify": {
          "real": true,
          "accurate": false,
          "correctedUrl": "https://arxiv.org/abs/2412.18061",
          "note": "Real paper (Jeon, Guintu, Sahni 2024; Llama 3.2 + VAP via LSTM ensemble, correct). But eval-method is wrong: 83.13 is the VAP baseline's RECALL on CCPE, not the Lla-VAP ensemble's F1 \u2014 the ensemble's actual CCPE F1 is 0.964. The survey (arXiv 2505.18845) mis-transcribed it as \"Lla-VAP F1 83.13\"; no SOTA claim is made in the paper."
        }
      },
      {
        "name": "Multi-Party Conversational Agents: A Survey",
        "kind": "paper",
        "year": "2025",
        "url": "https://arxiv.org/abs/2505.18845",
        "whatItIs": "A survey organizing multi-party conversational-agent research, with explicit sub-sections on Turn Detection (when to speak) and Addressee Selection (whom to address), tabulating their benchmarks, metrics, and SOTA.",
        "evalMethod": "Itself a meta-resource: maps each capability to a named benchmark+metric \u2014 Turn Detection -> CCPE / F1 (SOTA 83.13%, Lla-VAP); Addressee Selection -> Ubuntu IRC / Accuracy (SOTA 84.65%, ASRG). Explicitly flags that modeling 'whether silence is correct' (response inhibition) is largely unexplored.",
        "relevance": "The single best entry point for this sub-topic: it consolidates the benchmark/metric landscape for when-to-speak and addressee selection, and its own stated gap ('response inhibition / silence-as-correct largely unexplored') frames the open problem for group-chat agent eval.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2505.18845",
          "note": "Verified real: \"Multi-Party Conversational Agents: A Survey\" (Sapkota et al., 2025). Table 1 confirms Turn Detection->Lla-VAP F1 83.13 and Addressee Selection->ASRG Ubuntu IRC Accuracy 84.65; CCPE cited. Response-inhibition gap is a fair characterization (paper omits modeling correct silence), though framed by the surveyor rather than stated verbatim."
        }
      }
    ]
  },
  {
    "subtopic": "Addressee recognition and speaker/mention resolution in multi-party conversation: evaluating whether agents talk to the right participant",
    "gaps": [
      "Almost all addressee/disentanglement benchmarks (Ubuntu IRC, Molweni) score CLASSIFICATION on human-authored logs \u2014 they grade whether a model can label who-talks-to-whom in a transcript, not whether a generative agent ACTUALLY addresses the right participant in its own emitted turn within a live group chat. End-to-end 'did the agent route its reply to the correct person' eval is thin.",
      "No widely adopted benchmark grades mis-addressing as a distinct failure mode with calibrated cost (e.g. answering participant B's question but @-mentioning participant A, or leaking one user's context to another). Privacy/cross-talk leakage in group chat (touched by MAGPIE) is largely separate from addressee-correctness eval.",
      "Modern @-mention / reply-thread resolution in real platforms (Slack, Discord, Teams) lacks a public LLM-agent benchmark; classic datasets predate threaded UIs with explicit @-mentions and reply pointers, so the easy signal (explicit mention) vs implicit-addressee disambiguation is under-studied for agents.",
      "LLM evidence (TEIDAN/GPT-4o near chance) shows the capability is weak, but there is no standardized agentic benchmark measuring addressee accuracy as part of a TASK-completing group-chat agent (e.g. a meeting/scheduling assistant in a 5-person thread) with downstream task-success tied to correct addressing.",
      "Metrics are fragmented and non-comparable: disentanglement uses VI/1-1/Shen-F, addressee selection uses accuracy, response selection uses recall@k. There is no unified joint metric for an agent that must simultaneously disentangle the thread, identify the addressee, AND produce a correct response in real time.",
      "Most corpora are English IRC or single small non-English multimodal sets; large-scale, multilingual, naturalistic group-chat data with gold addressee labels AND streaming/online evaluation (predict before seeing the future) remains scarce.",
      "Speaker-attribution work from the audio/diarization side (SpeakerLM, M3-SLU, speaker-attributed ASR) and the text-chat addressee-recognition work are siloed; there is little joint eval of an agent that must do diarization-grade speaker ID and text-grade addressee resolution together in a spoken multi-party setting."
    ],
    "findings": [
      {
        "name": "Addressee and Response Selection for Multi-Party Conversation (Ouchi & Tsuboi)",
        "kind": "dataset",
        "year": "2016",
        "url": "https://aclanthology.org/D16-1231/",
        "whatItIs": "Foundational EMNLP 2016 paper that formalized the joint task of selecting BOTH whom an agent addresses and what it says in a multi-party conversation. Released a large multi-party corpus built from Ubuntu IRC logs.",
        "evalMethod": "Two evaluation tracks. Addressee selection: pick the correct addressee from the set of prior speakers in context (accuracy, ACC-A). Response selection: pick the gold response from candidates (recall@k). Joint ADR-ACC requires both addressee and response correct simultaneously. Reported across context sizes and participant counts.",
        "relevance": "Directly defines the 'talking to the RIGHT participant' metric: an agent's output is only correct if it routes to the right addressee AND says the right thing. The joint ADR accuracy is the canonical way to grade addressing in group chat, and the IRC-derived corpus became the standard benchmark for follow-on work.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://aclanthology.org/D16-1231/",
          "note": "Real EMNLP 2016 paper (Ouchi & Tsuboi, pp. 2133-2143); eval desc accurate \u2014 joint addressee (ADR) accuracy + response recall@k + joint pair selection (paper labels it ADR-RES, not ADR-ACC)."
        }
      },
      {
        "name": "Who Is Speaking to Whom? W2W model (Le, Hu et al.)",
        "kind": "paper",
        "year": "2019",
        "url": "https://aclanthology.org/D19-1199/",
        "whatItIs": "EMNLP-IJCNLP 2019 paper introducing the who-to-whom (W2W) model that identifies the addressee of EVERY utterance in a session jointly, not just the next response. Uses interacting GRUs (speaker-state, listener-state, utterance-fusion) scanning the session bidirectionally.",
        "evalMethod": "Evaluated on the Ouchi-Tsuboi Ubuntu IRC multi-party corpus. Metric is addressee identification accuracy per utterance, with breakdowns by number of session participants; compared against the static/dynamic baselines from Ouchi & Tsuboi.",
        "relevance": "Extends addressee resolution from 'who should the agent address next' to recovering the full who-talks-to-whom graph of a conversation. That graph is exactly what a group-chat agent needs to track to know which thread/participant a message belongs to before responding.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://aclanthology.org/D19-1199/",
          "note": "Real EMNLP-IJCNLP 2019 paper (Le, Hu et al., pp.1909-1919); URL correct. Verified from PDF: W2W jointly identifies all utterances' addressees via interacting SGRU/LGRU/UGRU with forward-backward session scanning, evaluated on Ouchi-Tsuboi Ubuntu corpus with addressee accuracy vs DRNN/SIRNN baselines. Description accurate."
        }
      },
      {
        "name": "A Large-Scale Corpus for Conversation Disentanglement (Kummerfeld et al.) / irc-disentanglement",
        "kind": "dataset",
        "year": "2019",
        "url": "https://github.com/jkkummerfeld/irc-disentanglement",
        "whatItIs": "ACL 2019 release of 77,563 #Ubuntu/#Linux IRC messages manually annotated with reply-to (parent-child) links forming reply-structure graphs that both disentangle interleaved conversations and define internal structure. 16x larger than all prior disentanglement datasets combined; includes adjudicated disagreements and context. Also on TensorFlow Datasets as irc_disentanglement.",
        "evalMethod": "Graph-edge prediction (which earlier message each message replies to), then clustering into conversations. Disentanglement clustering scored with VI (variation of information), one-to-one overlap, Shen-F, and exact-match conversation extraction. Reply-link prediction scored with precision/recall/F1 on edges. Served as DSTC-8 Track 2 Task 4.",
        "relevance": "The canonical benchmark for the prerequisite skill of group-chat understanding: separating multiple overlapping conversations in one shared stream. An agent that responds to the wrong participant usually failed disentanglement first \u2014 it answered a message belonging to a different ongoing thread.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://github.com/jkkummerfeld/irc-disentanglement",
          "note": "Confirmed: ACL 2019 (P19-1374), 77,563 Ubuntu/Linux IRC msgs w/ reply-structure graphs, 16x larger, adjudicated+context; on TFDS as irc_disentanglement; eval (link P/R/F1 + cluster VI/one-to-one/Shen-F/exact-match) and DSTC8 Track 2 use all accurate."
        }
      },
      {
        "name": "Molweni",
        "kind": "dataset",
        "year": "2020",
        "url": "https://aclanthology.org/2020.coling-main.238/",
        "whatItIs": "COLING 2020 machine-reading-comprehension dataset over multiparty dialogue, sampled from the Ubuntu Chat Corpus: 10,000 dialogs / 88,303 utterances, 30,066 questions (incl. unanswerable), with full SDRT-style discourse dependency annotations (78,245 relations, 16 relation types).",
        "evalMethod": "Two tasks. (1) MRC: answer 5W1H questions (notably many 'Who'-type) over a multi-party dialog, scored with SQuAD-style EM/F1; BERT-wwm drops to ~67.7% F1, a 20+ point fall vs SQuAD 2.0. (2) Discourse parsing: predict reply-to links and relation labels between utterances, scored on link/relation F1.",
        "relevance": "Tests whether a model can answer 'who said/did what to whom' over interleaved multi-party chat. The 'Who' questions and the discourse-link annotations directly probe speaker/addressee attribution, and the steep performance drop shows multi-party attribution is genuinely hard.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://aclanthology.org/2020.coling-main.238/",
          "note": "Confirmed real (Li et al., COLING 2020). Stats match: 10,000 dialogs, 88,303 utterances, 30,066 questions, 78,245 SDRT relations. Eval method accurate: MRC EM/F1 (BERT-wwm 67.7% F1, ~20pt drop vs SQuAD 2.0) + discourse parsing link/relation F1."
        }
      },
      {
        "name": "An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue (Inoue et al., TEIDAN)",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://arxiv.org/abs/2501.16643",
        "whatItIs": "Kyoto University benchmark testing whether modern LLMs (GPT-4o) can do addressee recognition and next-speaker prediction in spontaneous triadic (3-person) dialogue, using the new TEIDAN corpus (30 Japanese sessions, ~29h; ~20% of turns have an explicit addressee).",
        "evalMethod": "4-way classification per turn: addressee is participant A, B, C, or 'O' (no specific addressee); metric is accuracy against an 80.1% majority-class chance baseline. Also next-speaker prediction accuracy. Tested with and without gaze/multimodal features.",
        "relevance": "The modern LLM-era counterpart to Ouchi & Tsuboi: shows frontier LLMs barely beat chance on addressee recognition (GPT-4o 80.9% vs 80.1% chance) and fall BELOW chance on next-speaker prediction \u2014 direct evidence that talking to the right participant is an unsolved evaluation target for LLM agents.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2501.16643",
          "note": "Real (Inoue et al., Kyoto Univ., IWSDS 2025). Eval method accurate: 4-way A/B/C/O addressee classification vs 80.1% majority-class baseline, plus next-speaker prediction, with/without gaze via OpenFace, GPT-4o. Caveat: ~29h is wrong \u2014 the annotated subset is 29 min 20 sec (~half hour), not 29 hours; 30 sessions total is correct."
        }
      },
      {
        "name": "Multimodal Conversation Structure Understanding (MCSU)",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://arxiv.org/html/2505.17536v3",
        "whatItIs": "A 2025 benchmark for evaluating (multimodal) LLMs on the structural fabric of multi-party conversation \u2014 including speaker/addressee and reply-to relations \u2014 beyond surface content understanding.",
        "evalMethod": "Structured question/annotation tasks over multi-party dialogue covering speaker attribution, addressee identification, and reply/conversation-structure relations; models graded on per-relation accuracy/F1 against human annotations.",
        "relevance": "One of the few recent benchmarks bundling speaker AND addressee AND reply-structure into a single LLM-facing evaluation, which is precisely the cluster of skills a group-chat agent must demonstrate to route responses correctly.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2505.17536",
          "note": "Real: \"Multimodal Conversation Structure Understanding\" (UC Berkeley, May 2025) introduces the TV-MMPC benchmark; eval-method description (speaker/addressee/reply-to relations graded by accuracy/Set-F1 vs human annotations) is accurate. URL given was the HTML v3 render; canonical abs page provided."
        }
      },
      {
        "name": "Multi-Party Conversational Agents: A Survey",
        "kind": "paper",
        "year": "2025",
        "url": "https://arxiv.org/pdf/2505.18845",
        "whatItIs": "A 2025 survey of multi-party conversational agents that organizes the field around the sub-capabilities required for group settings, including who-speaks-next / turn-taking, addressee recognition, and conversation disentanglement, plus the datasets and metrics used for each.",
        "evalMethod": "Survey (no new eval); catalogs the standard tasks and their metrics \u2014 addressee/response selection accuracy, disentanglement clustering metrics (VI, 1-1, Shen-F), and next-speaker accuracy \u2014 and maps which datasets (Ubuntu IRC, Molweni, etc.) support which.",
        "relevance": "Provides the taxonomy and metric inventory for evaluating group-chat agents on addressing the right participant, and is a current entry point that ties the classic NLP datasets to modern LLM-agent settings.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/pdf/2505.18845",
          "note": "Real: arXiv:2505.18845 (Sapkota et al., 24 May 2025). Description roughly accurate \u2014 survey covers turn detection/next-speaker, addressee selection, disentanglement, response selection, with Ubuntu IRC/Molweni datasets and accuracy/F1/BLEU metrics; but it organizes around three themes (State-of-Mind, Semantic Understanding, Agent Action Modeling), and exact clustering metrics VI/1-1/Shen-F were not explicitly verified."
        }
      },
      {
        "name": "WHO Says WHAT to WHOM: A Survey of Multi-Party Conversations",
        "kind": "paper",
        "year": "2022",
        "url": "https://www.ijcai.org/proceedings/2022/0768.pdf",
        "whatItIs": "IJCAI 2022 survey framing multi-party conversation research around the three coupled questions of WHO (speaker), WHAT (utterance), and to WHOM (addressee), surveying tasks, models, and corpora.",
        "evalMethod": "Survey; consolidates the addressee-recognition and response-selection task formulations and their accuracy-based metrics, and the role of the Ubuntu IRC corpus as the shared benchmark.",
        "relevance": "The clearest conceptual statement of why group-chat evaluation must jointly grade speaker, content, and addressee \u2014 the 'to whom' axis is exactly the right-participant routing skill in scope here.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://www.ijcai.org/proceedings/2022/0768.pdf",
          "note": "Confirmed: IJCAI 2022 survey by Gu, Tao & Ling, framed around WHO/WHAT/WHOM. URL valid; survey covers addressee recognition + response selection tasks with Ubuntu IRC as the shared benchmark."
        }
      }
    ]
  },
  {
    "subtopic": "Evaluating social appropriateness, multi-speaker context tracking, and persona/role behavior of agents in group conversations",
    "gaps": [
      "Almost all benchmarks evaluate 2-party multi-turn (e.g., MultiChallenge) or dyadic role-play; genuine N>2 group-chat evaluation with overlapping threads, interruptions, and side conversations is thin \u2014 addressee recognition (2501.16643) is one of the few, and even it caps at triadic.",
      "Addressee / turn-taking and who-is-talking-to-whom tracking is shown to be near chance for frontier models, but there is no standardized, scalable benchmark for floor-management (when to speak, when to stay silent, interjection appropriateness) in a live group.",
      "Social-appropriateness rubrics (SOTOPIA's Social Rules, RENOVI norms) are mostly applied to whole dialogues or single agents; per-turn, speaker-conditioned appropriateness scoring (the same line appropriate from one role but not another) in an active group is underdeveloped despite NormBank's role-conditioning.",
      "Persona/role consistency benchmarks (PersonaGym, RPEval, CharacterBench) are predominantly single-agent question-answer or dyadic; persona drift specifically under multi-party pressure (conformity, peer influence) is only just being probed (Social Laboratory, DEBATE) and lacks standardized metrics.",
      "LLM-as-judge is the dominant grader for social/persona dimensions, but judge reliability for multi-party context (correctly attributing which agent did what across many speakers, handling long group transcripts) is largely unvalidated; SOTOPIA validates GPT-4-vs-human only in dyadic settings.",
      "Most group-dynamics findings (premature convergence, excessive partner influence, public/private belief dissociation in DEBATE) are diagnostic of unrealism but there is no agreed target metric or pass/fail threshold for 'human-like group behavior,' making cross-paper comparison hard.",
      "Cultural and multilingual coverage of group social-norm appropriateness is sparse (RENOVI/VideoNorms touch culture, DiscoTrack adds languages) \u2014 appropriateness norms vary by culture and there is no group-chat benchmark that systematically varies cultural setting.",
      "Memory/state tracking across very long group threads (who said what, evolving relationships, secrets among a subset of participants) is gestured at by SOTOPIA's Secret/Knowledge dimensions and Lifelong-SOTOPIA but lacks a dedicated multi-speaker long-context tracking benchmark."
    ],
    "findings": [
      {
        "name": "SOTOPIA / SOTOPIA-Eval",
        "kind": "benchmark",
        "year": "2023",
        "url": "https://arxiv.org/abs/2310.11667",
        "whatItIs": "An open-ended environment that simulates goal-driven social interactions between LLM agents who role-play diverse character profiles with private goals and relationship constraints, then scores their social intelligence. Published at ICLR 2024.",
        "evalMethod": "Seven-dimension holistic rubric (SOTOPIA-Eval): Believability [0-10], Relationship [-5,5], Knowledge [0-10], Secret [-10,0], Social Rules [-10,0] (norm/legal adherence), Financial benefit [-5,5], and Goal Completion [0-10]. Scored by both human annotators and GPT-4 acting as judge on the same prompts; GPT-4 scores are validated against human scores. Overall = average of the seven dimensions.",
        "relevance": "The canonical multi-turn, multi-party role-play social benchmark. Its Social Rules and Believability dimensions are direct, reusable rubric axes for grading social appropriateness and persona adherence in a group/multi-speaker setting; the GPT-4-as-judge-vs-human validation is a template for LLM-judge calibration.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2310.11667",
          "note": "Confirmed: SOTOPIA (Zhou et al., ICLR 2024) at arXiv 2310.11667; all seven SOTOPIA-Eval dimensions and ranges match, and GPT-4-judge-validated-against-humans is accurate."
        }
      },
      {
        "name": "DEBATE: Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://arxiv.org/abs/2510.25110",
        "whatItIs": "A large-scale benchmark (30,707 messages, 2,832 U.S. participants, 708 groups, 107 topics) that measures whether multi-agent role-playing LLMs reproduce authentic human group dynamics in long-form group discussions, with both public messages and private Likert-scale beliefs.",
        "evalMethod": "Utterance-level metrics: semantic similarity (embedding cosine), stance difference (Likert delta -2.5 to +2.5), ROUGE-L, length deltas, on-topic rate. Group-level opinion-dynamics metrics: opinion convergence (std-dev reduction across rounds), opinion shift, public-vs-private dissociation. Individual-level: regression-to-mean and partner-influence. Two settings: next-message prediction and full-conversation rollout.",
        "relevance": "The most directly group-chat-relevant resource: it evaluates emergent multi-party dynamics (premature convergence, partner influence, public/private belief drift) rather than single-turn quality. Provides concrete metrics for whether a group of agents behaves like a real group, a gap most single-agent evals miss.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2510.25110",
          "note": "Real paper (NeurIPS 2025); eval-method description (utterance/group/individual opinion-dynamics metrics, two settings) is accurate. Minor stat drift: cleaned set is ~29,417 msgs / 2,788 participants / 697 groups vs claimed raw 30,707/2,832/708; 107 topics correct. Title also appears as \"...Evaluating Opinion Dynamics in Role-Playing LLM Agents\" on the arXiv abstract page."
        }
      },
      {
        "name": "Addressee Recognition in Multi-modal Multi-party Dialogue (LLM benchmark)",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://arxiv.org/abs/2501.16643",
        "whatItIs": "A benchmark built on a multi-modal corpus of triadic (3-participant) discussions that tests whether an LLM can identify the addressee \u2014 who is being spoken to / who should take the next turn \u2014 a core multi-party-only skill. Authors: Inoue, Lala, Elmers, Ochi, Kawahara.",
        "evalMethod": "Classification accuracy on addressee prediction over annotated turns (explicit addressees appear in ~20% of turns). GPT-4o is benchmarked against chance baseline.",
        "relevance": "Isolates the single most distinctive multi-speaker-context-tracking subtask: knowing who is talking to whom. The result (GPT-4o only marginally above chance) is a sharp empirical signal that current LLMs fail at the addressee/turn-taking tracking that group-chat agents depend on.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2501.16643",
          "note": "Confirmed real (arXiv:2501.16643, also IWSDS 2025); authors Inoue, Lala, Elmers, Ochi, Kawahara and eval method (accuracy on addressee prediction, ~20% explicit-addressee turns, GPT-4o marginally above chance) all match."
        }
      },
      {
        "name": "MPCEval: A Benchmark for Multi-Party Conversation Generation",
        "kind": "benchmark",
        "year": "2026",
        "url": "https://arxiv.org/abs/2603.04969",
        "whatItIs": "A standardized, task-aware framework for evaluating multi-party conversation generation, covering both next-message prediction and full-conversation generation across varied participant configurations. Authors led by Minxing Zhang.",
        "evalMethod": "Mix of automatic reference metrics (perplexity, ROUGE, BLEU, BERTScore, BARTScore), neural/semantic metrics (G-Eval), and conversation-specific behavioral dimensions such as speaker coherence and dialogue naturalness across multiple speakers.",
        "relevance": "A purpose-built MPC benchmark that separates next-message prediction from full-rollout generation and explicitly scores speaker-coherence \u2014 directly applicable to grading whether a group-chat agent produces contextually appropriate, correctly-attributed turns.",
        "_verify": {
          "real": true,
          "accurate": false,
          "correctedUrl": "https://arxiv.org/abs/2603.04969",
          "note": "Paper is real (Minxing Zhang et al., arXiv:2603.04969, Mar 2026), but eval-method is wrong: MPCEval explicitly rejects ROUGE/BLEU/BERTScore/BARTScore/perplexity/G-Eval in favor of novel reference-free metrics across speaker modeling, content quality, and speaker-content consistency."
        }
      },
      {
        "name": "PersonaGym / PersonaScore",
        "kind": "benchmark",
        "year": "2024",
        "url": "https://arxiv.org/abs/2407.18416",
        "whatItIs": "The first dynamic evaluation framework for persona agents (200 personas, 10k questions, 150 environments), with PersonaScore as an automated, human-aligned metric grounded in decision theory. Published in EMNLP 2025 Findings.",
        "evalMethod": "Five task axes: Expected Action, Action Justification, Linguistic Habits, Persona Consistency, and Toxicity Control, each scored 1-5 by an ensemble of strong LLM judges; environments and questions generated per-persona.",
        "relevance": "Provides reusable, decision-theory-grounded metrics for persona/role behavior \u2014 the persona-consistency and toxicity-control axes map onto grading whether each agent stays in character and socially appropriate across a long group conversation. Notably finds frontier models barely beat weaker ones (Claude 3.5 Sonnet only ~3% over GPT-3.5).",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2407.18416",
          "note": "Confirmed real and accurate: PersonaGym/PersonaScore, 200 personas, 10k questions, 150 environments, EMNLP 2025 Findings, decision-theory metric, and all 5 task axes (Expected Action, Action Justification, Linguistic Habits, Persona Consistency, Toxicity Control) all match the paper."
        }
      },
      {
        "name": "RENOVI: Remediating Norm Violations in Socio-Cultural Conversations",
        "kind": "benchmark",
        "year": "2024",
        "url": "https://arxiv.org/abs/2402.11178",
        "whatItIs": "A large-scale corpus of 9,258 multi-turn dialogues (512 human-authored + 8,746 ChatGPT-synthesized) annotated with social norms, designed to evaluate detecting and remediating norm violations step by step. Led by Haolan Zhan.",
        "evalMethod": "A sequenced task pipeline that progresses from norm-violation detection/classification through generating remediation responses, plus measurement of alignment between LLMs and human social-norm awareness; synthetic data is shown to improve performance.",
        "relevance": "Targets social-norm appropriateness explicitly, including the remediation step (how an agent should repair a norm violation) \u2014 relevant for evaluating whether a group-chat agent both notices and gracefully recovers from socially inappropriate moments.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2402.11178",
          "note": "Confirmed real (NAACL 2024 Findings, lead author Haolan Zhan); 9,258 dialogues (512 human + 8,746 ChatGPT-synthesized), sequential detect-then-remediate task pipeline with LLM-human norm alignment analysis \u2014 all accurate."
        }
      },
      {
        "name": "NormBank (SCENE taxonomy)",
        "kind": "dataset",
        "year": "2023",
        "url": "https://aclanthology.org/2023.acl-long.429.pdf",
        "whatItIs": "A knowledge bank of 155k situational social norms (Ziems et al., ACL 2023) where each norm is grounded in a multivalent sociocultural frame \u2014 setting, agent roles, attributes, and physical/social/cultural constraints \u2014 via the SCENE taxonomy.",
        "evalMethod": "Supports non-monotonic normative reasoning by encoding contrast sets where the same behavior is labeled expected / permitted / unexpected depending on role and setting; used as an eval/training resource for moral classification and social-commonsense QA.",
        "relevance": "Provides role- and setting-conditioned norm labels \u2014 crucial for evaluating social appropriateness in group chat where the same utterance is appropriate from one role/speaker but not another. Its role+setting grounding is a model for context-dependent appropriateness scoring.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://aclanthology.org/2023.acl-long.429.pdf",
          "note": "Real: NormBank (Ziems et al., ACL 2023, paper 2023.acl-long.429); 155k norms, 63k SCENE-taxonomy constraints, non-monotonic. Labels are expected/permitted/unexpected as a classification task \u2014 description roughly accurate; \"social-commonsense QA\" framing is loose but fair. URL valid."
        }
      },
      {
        "name": "The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation",
        "kind": "framework",
        "year": "2025",
        "url": "https://www.arxiv.org/abs/2510.01295",
        "whatItIs": "A psychometric evaluation framework (Zarreen Reza) that assesses LLMs as social actors inside multi-agent debates rather than in isolation, using a 3-round multi-party debate structure (Change My View data).",
        "evalMethod": "Measures conformity dynamics (position shift under group pressure), persuasion effectiveness (ability to move others), role adherence (persona consistency across rounds), and broader psychometric properties applied to agent behavior.",
        "relevance": "Frames group-chat evaluation as measuring emergent interactional properties (conformity, persuasion, role stability) across multiple agents \u2014 directly addresses social appropriateness and persona/role behavior at the group rather than individual level.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2510.01295",
          "note": "Real paper by Zarreen Reza (NeurIPS 2025 workshop); eval-method description roughly accurate \u2014 multi-agent debate \"social laboratory\" with personas, moderator, and psychometric metrics for conformity/consensus, persuasion, and persona role adherence. Could not independently verify the exact \"3-round / Change My View\" specifics, but core method matches. Normalized URL host www.arxiv.org -> arxiv.org."
        }
      }
    ]
  },
  {
    "subtopic": "Frameworks, simulators, and user-simulation harnesses for evaluating agents in multi-party / group-chat environments",
    "gaps": [
      "Simulated-participant fidelity is unsolved and is the dominant threat to validity: RealUserSim and the Sim2Real-Gap paper show LLM simulators are over-cooperative ('easy mode') and inflate agent success, yet almost no group-chat eval reports a simulator-fidelity score (USI / PT3-style) alongside agent grades.",
      "Most 'multi-party' frameworks are actually agent-vs-agent (SOTOPIA, MultiAgentBench) or still dyadic underneath (tau2-bench is one user + one agent). True N-human-simulated-participants-plus-one-agent-under-test harnesses with grading are rare (ProMediate, MUCA, GroupMemBench are the main exceptions).",
      "Turn-taking, interruption, and addressee management (who to respond to, when to stay silent) are barely measured \u2014 the addressee-recognition benchmark shows even GPT-4o is near chance, but no integrated harness grades an agent's live floor-management decisions during a simulated group chat.",
      "No standard reward model for group-chat success exists. Binary task reward (tau-bench style) is shown to be orthogonal to human-perceived quality, but the alternatives (consensus change, conversation evenness, mediator effectiveness) are bespoke per-paper and not cross-comparable.",
      "LLM-as-judge bias is documented (SOTOPIA over-scores Social Rules/Secret) yet there is no calibrated, group-chat-specific judge or inter-rater-agreement protocol for multi-party transcripts where the judge must attribute behavior to the right speaker across many turns.",
      "Scalability and cost of running many simulated participants over long multi-party sessions is unaddressed; no framework reports how grading reliability degrades as participant count and conversation length grow.",
      "Speaker-grounded memory / per-user belief tracking and audience-adapted language are only just being benchmarked (GroupMemBench), with best systems under 50% accuracy \u2014 there is no eval that jointly grades memory, social behavior, AND task completion in one multi-party run.",
      "Reproducibility: simulated-participant configs (persona seeds, conflict modes, decoding params) strongly affect agent scores, but few harnesses pin or release them, making cross-paper comparison of group-chat agent results unreliable."
    ],
    "findings": [
      {
        "name": "ProMediate",
        "kind": "framework",
        "year": "2025",
        "url": "https://arxiv.org/abs/2510.25224",
        "whatItIs": "A socio-cognitive framework (USC + Microsoft) for evaluating proactive AI mediator agents in multi-topic, multi-party negotiations. Includes a simulation testbed with theory-driven difficulty tiers (Easy/Medium/Hard) and a plug-and-play mediator that decides when/how to intervene.",
        "evalMethod": "Five metrics: Consensus Change (windowed agreement delta over last-10 minus first-10 turns), Topic-Level Efficiency, Response Latency, Mediator Effectiveness (consensus-trend slope before/after intervention), and Mediator Intelligence (LLM-as-judge 1-5 across four socio-cognitive dimensions). Participant attitudes extracted per turn by GPT-4.1; pairwise agreement scored across 5 dimensions then group-averaged. Simulated participants generated by Claude-Sonnet-4 under three conflict modes (Accommodating/Avoiding/Competing) with explicit preference rankings; human validation of naturalness (4.18/5) and mode-consistency (3.61/5).",
        "relevance": "The most directly on-point resource: it simulates multiple human participants in a group negotiation and grades an agent's intervention quality, not just final task success. The consensus-change and intervention-latency metrics are exactly the kind of group-chat-specific eval machinery the sub-topic asks for.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2510.25224",
          "note": "Confirmed real (Liu, Sarrafzadeh, Zhou, Yang, Zhao, Sharma \u2014 USC + Microsoft); all five metrics, GPT-4.1 attitude extraction, Claude-Sonnet-4 participants with three conflict modes, and human-validation scores (4.18/3.61) match the paper."
        }
      },
      {
        "name": "SOTOPIA / SOTOPIA-EVAL",
        "kind": "benchmark",
        "year": "2023",
        "url": "https://arxiv.org/abs/2310.11667",
        "whatItIs": "An open-ended, procedurally generated social-interaction environment (ICLR 2024) where LLM agents role-play characters with private goals and relationship constraints. Scenarios span dyadic negotiation through multi-party planning (group event scheduling, multi-agent games), cooperative/competitive/mixed-motive.",
        "evalMethod": "SOTOPIA-EVAL scores each episode on seven dimensions on 11-point Likert scales: Goal Completion (0-10), Believability (0-10), Knowledge (0-10), Secret (-10-0), Relationship (-5-5), Social Rules (-10-0), Financial/Material (-5-5). Scored by human raters and/or GPT-4 as LLM judge. Modeled as a Dec-POMDP where agents condition on global+local history. Paper notes LLM judges overestimate on Social Rules and Secret dimensions.",
        "relevance": "Canonical multi-party social simulation eval: agents (or humans) are simulated participants pursuing hidden goals, and the seven-dimension rubric is a reusable template for grading social/relational behavior in group chat rather than narrow task pass/fail. Its documented LLM-judge bias is a useful caution.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2310.11667",
          "note": "Real (SOTOPIA, ICLR 2024 spotlight). 7 dims + ranges + 11-pt Likert + human/GPT-4 judges all verified correct; but \"Dec-POMDP\" framing is fabricated and the \"judges overestimate on Social Rules/Secret\" line is an inexact paraphrase (paper says GPT-4 is weaker on SOC/SEC)."
        }
      },
      {
        "name": "tau2-bench (\u03c4\u00b2-bench)",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://github.com/sierra-research/tau2-bench",
        "whatItIs": "Sierra's benchmark for tool-agent-user interaction. \u03c4\u00b2-bench extends \u03c4-bench to a dual-control setting (Telecom domain) where BOTH the simulated user and the agent can call tools, modeling collaborative troubleshooting.",
        "evalMethod": "A UserSimulator LLM (wrapped via Tau2UserSimulatorAdapter, generate_next_message(message, state)) plays the human across turns, maintaining conversation state. Scoring gates reward on evaluation_criteria.actions (required tool actions / database end-state) plus policy compliance; supports num-trials repeated runs and pass^k reliability (fraction of tasks an agent solves on all k attempts).",
        "relevance": "The reference user-simulation harness for conversational-agent eval and the natural base to extend toward multi-party. Its pass^k reliability metric and 'simulated user drives the conversation' design are the standard machinery practitioners reach for; dual-control is a step toward multiple acting participants.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://github.com/sierra-research/tau2-bench",
          "note": "Confirmed: Sierra's tau2-bench (arXiv 2506.07982) extends tau-bench to a dual-control Telecom domain where both the simulated user and agent call tools; scoring gates on evaluation_criteria actions/DB end-state plus policy, with pass^k reliability over num-trials. Eval-method description is accurate."
        }
      },
      {
        "name": "RealUserSim",
        "kind": "framework",
        "year": "2026",
        "url": "https://arxiv.org/abs/2605.20204",
        "whatItIs": "A user-simulation framework grounded in real behavioral data: extracts 7,275 executable behavioral profiles from 14,000+ authentic human-LLM conversations (WildChat) and uses them to ground LLM simulators, instead of generic or hand-written personas.",
        "evalMethod": "Introduces a fidelity benchmark (PT3) over 600 conversations / 71+ domains with anti-leakage controls, scoring style/behavior match rate across five behavioral dimensions (grounding raised match from 24.2% to 45.3%). Runs agent eval on TauBench with 6 simulator models, surfacing failure mechanisms invisible to cooperative simulators (-3.2% to -3.5% task-success degradation). Names two failure modes: Formalism Ceiling and Directive Amplification.",
        "relevance": "Directly attacks the central validity problem for simulated-participant eval: are your simulated humans realistic enough to trust the grade? Essential reading for anyone building a group-chat eval harness, since grounding (not just prompting) is shown to materially change agent scores.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2605.20204",
          "note": "Confirmed real (arXiv 2605.20204, \"RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation\"); eval-method description matches \u2014 WildChat profiles, PT3 fidelity benchmark (24.2%->45.3%, five dimensions), TauBench w/ 6 simulators, and both named failure modes verified."
        }
      },
      {
        "name": "Mind the Sim2Real Gap in User Simulation for Agentic Tasks",
        "kind": "paper",
        "year": "2026",
        "url": "https://arxiv.org/abs/2603.11245",
        "whatItIs": "A CMU LTI study quantifying how faithfully LLM user simulators replicate real human behavior in agent interactions, and how that mismatch distorts benchmark scores.",
        "evalMethod": "Introduces the User-Sim Index (USI), a 0-100 composite over six dimensions (communication style, information patterns, clarification behavior, error reactions, outcome calibration, evaluative alignment). Uses Sorensen-Dice for behavioral-feature overlap, Expected Calibration Error for outcome agreement, MAE for quality scoring, validated against 451 human annotators on 165 \u03c4-bench tasks. Finds LLM simulators create an 'easy mode' inflating agent success above the 63.6% human baseline, and that binary reward is largely orthogonal to human-perceived quality.",
        "relevance": "Provides the metric (USI) and human-grounded methodology to validate the simulated participants in any group-chat eval before trusting agent grades. Its finding that binary task reward diverges from human quality judgments is a core warning for multi-party eval design.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2603.11245",
          "note": "Confirmed: CMU LTI paper (Zhou, Neubig, Sap et al.). USI is a 0-100 composite over six dimensions using Sorensen-Dice, ECE, and MAE; validated on 451 humans / 165 tau-bench tasks; 63.6% human baseline and binary-reward orthogonality all match."
        }
      },
      {
        "name": "SAGE",
        "kind": "framework",
        "year": "2025",
        "url": "https://arxiv.org/abs/2510.11997",
        "whatItIs": "A top-down/bottom-up knowledge-grounded user simulator for multi-turn agent evaluation (Columbia DAPLab, Findings of EACL 2026). Grounds simulated users in business logic (ideal customer profiles) and agent infrastructure (product catalogs, FAQs, knowledge bases).",
        "evalMethod": "Generates realistic, domain-grounded multi-turn dialogues against the agent under test; effectiveness measured as bug-finding power \u2014 the grounded simulator surfaces up to 33% more agent errors than generic-user baselines, and produces more realistic/diverse interactions for iterative agent improvement.",
        "relevance": "Shows the eval payoff is error discovery, not a leaderboard number: a better simulated participant exposes more agent failures. The top-down (persona) + bottom-up (knowledge) grounding recipe transfers to constructing believable participants in a group-chat harness.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2510.11997",
          "note": "Confirmed real: SAGE (Shea, Lu, Qiu, Yu; Columbia DAPLab; Findings of EACL 2026). Top-down/bottom-up knowledge-grounded user simulator; eval-method and \"up to 33% more agent errors\" claim both accurate."
        }
      },
      {
        "name": "GroupMemBench",
        "kind": "benchmark",
        "year": "2026",
        "url": "https://arxiv.org/abs/2605.14498",
        "whatItIs": "A benchmark for LLM agent memory in genuinely multi-party conversations (groups/channels with multiple users talking to the agent and to each other), built because existing memory benchmarks assume dyadic single-user setups.",
        "evalMethod": "A graph-grounded synthesis pipeline builds multi-party discussions with controllable reply structure, conditioning each message on per-user personas and target audiences. An adversarial query pipeline binds questions to specific askers across six categories (multi-hop, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, abstention) and grades accuracy on three properties: group dynamics, speaker-grounded belief tracking (per-user memory), and audience-adapted language (Theory-of-Mind). Strongest system scored only 46.0% (knowledge-update 27.1%, term-ambiguity 37.7%).",
        "relevance": "One of the few benchmarks purpose-built for group-chat agents rather than dyads. Its speaker-grounded belief tracking and audience-adapted-language axes are precisely the multi-party capabilities a group-chat eval must isolate, and its synthetic-participant pipeline is a concrete construction method.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2605.14498",
          "note": "Confirmed real: arXiv 2605.14498 \"GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations\" matches the title, multi-party premise, graph-grounded synthesis pipeline, three properties (group dynamics, speaker-grounded belief tracking, audience-adapted language/ToM), and exact scores (46.0% best, knowledge-update 27.1%, term-ambiguity 37.7%)."
        }
      },
      {
        "name": "MUCA + MUS (Multi-User Chat Assistant / Multi-User Simulator)",
        "kind": "framework",
        "year": "2024",
        "url": "https://arxiv.org/abs/2401.04883",
        "whatItIs": "MUCA is described as the first LLM framework dedicated to multi-user group conversations, structured around the 3W problem (What/When/Who to answer). MUS is its paired LLM-based Multi-User Simulator used to evaluate MUCA.",
        "evalMethod": "MUS models user behavior from real chat records (speaking roles, utterance traits/length) then generates natural utterances, simulating multiple participants in a group; it improves via human-in-the-loop feedback. MUCA is graded against a baseline chatbot on metrics: user engagement, conversation evenness, opinion consensus, efficiency, conciseness, and usefulness, across decision-making, problem-solving, and open-discussion tasks.",
        "relevance": "An early, explicit simulate-multiple-humans-to-test-a-group-agent harness. The 'When' and 'Who to answer' dimensions and the conversation-evenness / opinion-consensus metrics are group-chat-specific eval constructs absent from dyadic benchmarks.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2401.04883",
          "note": "Confirmed: arXiv 2401.04883 (Mao et al., 2024) describes MUCA + MUS, the 3W problem, MUS behavior-modeling from real chat records with human-in-the-loop, and the listed metrics vs a GPT-4 baseline across decision-making/problem-solving/open-discussion tasks. All accurate."
        }
      },
      {
        "name": "An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://arxiv.org/abs/2501.16643",
        "whatItIs": "A benchmark isolating addressee recognition \u2014 identifying who is being spoken to / who should take the next turn \u2014 a task unique to multi-party (3+ participant) dialogue.",
        "evalMethod": "Grades models on accuracy of predicting the addressee for the next turn over multi-party (and multi-modal) dialogue transcripts; GPT-4o scored only marginally above chance, quantifying the difficulty.",
        "relevance": "Targets the single hardest primitive in group-chat agent behavior \u2014 turn/addressee management \u2014 with a clean accuracy metric. A group-chat agent eval suite needs an addressee-recognition component, and this provides a ready measurement.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2501.16643",
          "note": "Confirmed real (arXiv 2501.16643, Inoue et al., IWSDS 2025); description accurate \u2014 addressee recognition in triadic multi-modal dialogue, GPT-4o only marginally above chance."
        }
      },
      {
        "name": "MultiAgentBench (MARBLE)",
        "kind": "benchmark",
        "year": "2025",
        "url": "https://arxiv.org/abs/2503.01935",
        "whatItIs": "A benchmark suite (Zhu et al.) for LLM multi-agent systems across cooperative (research collab, Minecraft build, DB diagnosis, coding) and competitive (bargaining, Werewolf social deduction) scenarios.",
        "evalMethod": "Milestone-based KPIs: agent j's KPI = n_j/M (milestones it contributed to over M total), averaged across agents. Quality dimensions scored 0-5 by LLM evaluation: Communication Score, Planning Score, and Coordination Score (mean of the two); competition scenarios use process-level win/loss scaled 0-100.",
        "relevance": "Provides reusable group-interaction grading machinery \u2014 milestone attribution plus LLM-judged communication/coordination scores \u2014 for evaluating how an agent behaves among multiple other agents. The Werewolf/bargaining scenarios are multi-party social settings adjacent to group chat.",
        "_verify": {
          "real": true,
          "accurate": true,
          "correctedUrl": "https://arxiv.org/abs/2503.01935",
          "note": "Confirmed real (Zhu et al., ACL 2025 Main, MARBLE repo). KPI=(1/NM)Sum(n_j) and 5-point Communication/Planning scores with Coordination=mean of the two match exactly; competition uses scenario win/loss (the 0-100 scaling is directionally right, not a single stated formula)."
        }
      }
    ]
  }
]