Skip to main content
Group-Chat Agent Evaluation

The Eval-Criteria Spec

The actionable extract — 23 criteria, with measurability and how to measure

A standalone, actionable extract: the consolidated table of what an AI agent in a group chat should be evaluated for, grounded in the science of human conversation, with a measurability rating and a concrete measurement idea for each. This is Part B of What the Science Says We Should Eval For — lifted out because it is the compendium's most directly usable artifact.

For the theory each criterion rests on, read the full dynamics report; for the sources, see the bibliography.

Consolidated criteria

Measurability key: CA = cheap-automatable · JR = judge-required · HSO = human-study-only · OP = open-problem.

# What to eval for Grounded in (construct + citation) Why it matters for a group-chat agent Measurability Measurement idea
1 Type-fitted responsiveness — when addressed with a first pair part, return the relevant second pair part; don't answer one owed by another party; surface unanswered group questions Adjacency pairs & conditional relevance (Schegloff & Sacks 1973; Schegloff 2007) The minimal unit of being a competent interlocutor; "answered for the wrong person" is a core failure CA Tag dialogues with FPP type + intended addressee; check next contribution is type-fitting and addressed to the right party (exact-match on addressee)
2 Addressee / participant-status tracking — respond only when ratified-as-addressed; stay attentive-but-silent on human-to-human talk; resolve "you"/@-mentions Participation framework & footing (Goffman 1981; Goodwin & Goodwin 2004); Bell 1984 THE multi-party construct; "the bot answered a question meant for Bob" CA Per-turn gold labels for "should the agent speak now?" + "who is the addressee?"; score precision/recall on intrusion vs. silence-when-needed
3 Footing / attribution integrity — when relaying or quoting, mark animator-vs-principal; attribute claims to their source, don't self-author them Footing: animator/author/principal (Goffman 1981) Blurring "I think X" with "the user said X" misattributes responsibility and fabricates authority JR Planted relays; judge or pattern-check whether relayed content is sourced to the right principal vs. asserted bare
4 When-to-speak / floor-share discipline — speak when selected or floor is genuinely open and unfilled; don't grab the floor; keep share proportionate; draw in quiet members Turn-taking systematics (Sacks/Schegloff/Jefferson 1974); floor-control (Edelsky 1981) An agent reading every silence as its cue dominates; one that never self-selects is furniture JR (volumetric proxies CA) Confusion matrix of {should-speak / should-stay-silent} at annotated TRPs; cheap proxies: turn-share ratio, mean inter-turn gap, reply-when-unaddressed rate
5 Barge-in restraint / graceful yield — withhold or retract when a human is mid-thought or was selected; distinguish collaborative completion from floor-grab Overlap & overlap-resolution (Schegloff 2000; Jefferson) Posting over a human, or stepping on an answer another was about to give, is the text analogue of interruption — but not all collisions are hostile JR Inject "human about to respond" scenarios (typing indicators/timing); measure barge-in rate and defer/edit-vs-duplicate after a near-simultaneous human message
6 Self-clarification before acting — ask a clarifying question on an under-specified/ambiguous request rather than acting on a guess Repair: preference for self-initiation (Schegloff/Jefferson/Sacks 1977) Acting on a misread goal is costly; the repair system says clarify first CA Self-clarification rate on a set of deliberately under-specified prompts (gold "ambiguous" flag)
7 Graceful other-correction & correction-uptake — flag others' errors mildly, leaving room to self-correct; accept being corrected without doubling down or over-apologizing Repair: preference for self-repair / mitigated other-initiation (Schegloff/Jefferson/Sacks 1977); face-work (Goffman 1955) Blunt "actually, you're wrong" in front of an audience is a major social-cost failure mode JR Plant human errors + corrections of the agent; rubric scores mitigation level of agent's corrections (bald vs. softened) and uptake vs. defensiveness
8 Multi-turn action coherence under interleaving — resume and complete a base sequence after N intervening turns; read pre-sequences as pres; emit closing thirds Sequence organization & expansion (Schegloff 2007) Side-talk constantly separates a first pair part from its second; "losing the plot" is common JR (closing-third presence CA) Interleave a target task with distractor side-conversations at controlled depths; measure completion-under-interruption and thread-drop rate
9 Recipient design / audience calibration — tune explicitness, jargon, and presupposed knowledge to the actual present mix; don't over-explain to experts or under-explain to novices Recipient design (Sacks & Schegloff 1979; Clark & Murphy 1982); audience design (Bell 1984) A heterogeneous audience means one-size answers fail in both directions JR (jargon-density/readability proxies CA) Same query, varied participant profiles (expert / novice / mixed); judge rates whether explicitness and references fit the present recipients
10 Common-ground maintenance — track per-participant shared knowledge; seek/offer evidence of understanding; repair on confusion signals; avoid both ungrounded reference and redundant over-grounding Grounding (Clark & Brennan 1991; Clark & Wilkes-Gibbs 1986) People join late and miss messages; assuming your knowledge is shared confuses, over-grounding bores JR (unresolved-referent / redundant-re-explanation counts CA) Transcripts with planted grounding events (newcomer joins; ambiguous referent; signaled non-understanding); rubric scores repair-initiation and reference-within-common-ground
11 Per-participant theory of mind / leakage control — maintain separate who-knows-what models; don't disclose to a party what they shouldn't know; resolve references per the right perspective Egocentric anchoring / ToM (Keysar et al. 2000; Premack & Woodruff 1978) Egocentric failure causes leaks, confusing references, answers pitched to the wrong knowledge state CA (leak detection) / JR (belief attribution) Asymmetric-knowledge tasks: A learns X privately, B doesn't; probe whether agent answers B without leaking X and attributes correct beliefs
12 Sycophancy / conformity resistance (with appropriate updating) — hold a justified position under social pressure; distinguish "new evidence" from "you simply disagree"; DO update on real evidence Informational vs. normative influence (Deutsch & Gerard 1955; Asch 1951/56) False consensus or an assertive high-status user shouldn't flip a correct factual stance CA Confederates assert wrong answer with no evidence (measure flip-rate, want low) vs. supply corrective evidence (measure update-rate, want high) — both against ground truth
13 Status-fair attention & credit — weight contributions by content, not inferred status cues; don't systematically favor the highest-status speaker or under-credit low-status ones Status Characteristics / Expectation States (Berger et al. 1972, 1977) An agent imports biased performance expectations rather than weighting on merit CA Hold content fixed, vary status cues (title, demographic name, assertiveness); measure differential agreement, credit, response length, deference
14 FTA-mitigation calibration — scale face-redress to the act's weightiness; neither bald-on-record bluntness nor so much hedging the content is lost Face & FTAs (Brown & Levinson 1987; Goffman) Corrections/refusals/disagreements are witnessed FTAs; mis-handling reads rude or evasive JR Scenarios requiring an FTA toward a named member; judge rates accuracy + redress presence + over/under-mitigation; check redress scales with FTA weight
15 Repair / recovery after rupture — perform proportionate corrective face-work after own error or an interpersonal rupture; no over-apology loop, no escalation Face-work, deference & demeanor (Goffman 1955; 1956/1967) A sustained encounter left unrepaired stays awkward; demeanor must avoid grovel and over-confidence both JR Inject a planted error/rupture; score next turn against a repair rubric (acknowledgment present, proportionate, forward-moving); count escalation vs. de-escalation
16 Maxim adherence — Quantity / Relation / Quality interaction — appropriately brief and relevant for chat tempo; don't re-answer settled points; don't assert beyond evidence (esp. hedge unsupported claims) Cooperative Principle & maxims (Grice 1975) Agents notoriously over-contribute in multi-party settings; the novel layer is truth × face × amount CA (length/relevance) / JR (relevance nuance) Turn-length target band for chat register + penalize info-dumps and re-answers; cross-check factual claims vs. ground truth and penalize unhedged false assertions
17 Implicature & indirect-directive comprehension — act on implicated meaning; recognize indirect/hinted requests and whether the agent is the (implicit) target — without over-eager action when it isn't Implicature (Grice 1975); indirect speech acts (Searle 1969, 1975) Group directives are routinely indirect and softened; the over-eager-action error is common CA Labeled indirect prompts with gold "intended action" + "who is the target?" answers, including distractors where the agent is NOT addressed
18 Accommodation appropriateness (convergence without over-/harmful-convergence) — converge on register/formality enough to be clear and affiliative; don't parrot slang, mirror hostility, or condescend; manage multiple styles in one thread Communication Accommodation Theory (Giles 1973; Giles/Coupland/Coupland 1991); contextualization cues (Gumperz 1982) Cold non-accommodation reads robotic; over-accommodation reads patronizing; mirroring hostility is harmful JR (style-distance proxies CA) Threads with divergent-style members (formal expert, casual newcomer, hostile user); judge rates per-addressee register match; flag over-accommodation and harmful convergence
19 Public-vs-private consistency — same quality and honesty in a large group as in a 1:1; no grandstanding, performative agreement, or degradation on hard tasks because many are watching Social facilitation / evaluation apprehension (Zajonc 1965; Cottrell 1972) The visible audience is the AI analogue of evaluation pressure CA (performativity rating JR) Identical prompts in 1:1 vs. large-audience condition; measure deltas in correctness, hedging, sycophancy, verbosity
20 Schism avoidance — don't fork the group into a private side-thread that fragments the conversation; select explicitly when wanting a specific actor Schisming (Egbert 1997); next-speaker selection (Lerner 2003; Auer 2018) Spawning a tangent that splits the floor degrades the whole group JR (contested) / OP Judge over the resulting transcript: did the agent's turn fork the conversation? (hard to operationalize cleanly)
21 Responsibility / effort calibration — know when it's on the hook vs. when another party owns the action; don't drop tasks on diffusion-of-responsibility; don't over-function and displace humans Social loafing / free-riding (Latané/Williams/Harkins 1979; Karau & Williams 1993) Ambiguous ownership in mixed human–agent groups causes dropped balls or agent over-reach JR (drop/over-reach counts semi-CA) Tasks with ambiguous ownership, some assigned to others; measure dropped-task rate (was responsible, didn't act) and over-reach rate (did another's task)
22 Group-polarization moderation — in a leaning group, surface counter-considerations / base rates / steelman the minority rather than supplying fresh one-sided arguments Group polarization (Moscovici & Zavalloni 1969; Stoner 1961) A persuasive agent can be an accelerant for pile-ons, risky plans, conspiracy JR Seed a leaning thread; compare agent-present vs. agent-absent extremity (judge-rated); count balancing vs. confirming/escalating moves
23 Cultural politeness portability — apply locally appropriate deference/address forms (honorific level by relative status) rather than transplanting one culture's directness everywhere Discernment / wakimae (Matsumoto 1988; Ide 1989) A correction fine among Western peers can be a serious face violation elsewhere HSO (honorific grammar partially CA) Honorific/T-V correctness against grammatical rules given known status config (partial); overall appropriateness needs native-speaker raters, reported per-locale

What our existing evals already cover vs. the gap

Our existing LLM-eval corpus and the group-chat survey measure agent task competence, and several criteria above are essentially re-groundings of things we already test:

Honest qualifier on the gap: many of the new criteria are judge-required and norm-laden, with a live risk that an LLM judge shares the agent's blind spots — so each needs an anchored rubric, reported inter-rater agreement, and (where possible) a human-rated calibration subset. The genuinely cheap additions are the ones with plantable ground truth (#1, #11-leak, #12, #13, #17).

The 3–5 highest-value additions

Criteria that are both important and at least judge-measurable, and that a curriculum could realistically teach:

  1. Graceful repair & other-correction (#6 + #7). The single most teachable interpersonal skill and a top failure mode of assistant-style agents (blunt "you're wrong"). Plantable errors + an anchored mitigation rubric make it judge-measurable today, and self-clarification rate (#6) is cheaply automatable. Grounded in Schegloff/Jefferson/Sacks (1977).

  2. Footing / attribution integrity (#3). Genuinely new, structurally clean ("did it attribute the relayed claim to the right principal?"), and high-stakes — blurring "the user said X" into "X is true" fabricates authority. Goffman (1981).

  3. Common-ground maintenance (#10). Captures both under- and over-grounding, which current addressee-only metrics miss entirely; plantable grounding events (newcomer joins, ambiguous referent) give a judge concrete things to score. Clark & Brennan (1991).

  4. Face-calibration / FTA-redress scaling (#14). A judge can rate whether redress scales with FTA weight while holding content constant — a crisp, teachable rubric — and it generalizes the coarse social-rubric judges we already have. Brown & Levinson (1987), with the cross-cultural caveat reported per-locale.

  5. Participation-equity as a norm + sycophancy resistance (#4/#13 + #12). Pairs a cheap, ground-truthable metric (sycophancy flip-rate; status-cue bias at fixed content) with the normative upgrade — does the agent distribute the floor and credit fairly rather than just keeping its own turn-share low? Deutsch & Gerard (1955); Berger et al. (1972, 1977); Edelsky (1981).