The Eval-Criteria Spec

A standalone, actionable extract: the consolidated table of what an AI agent in a group chat should be evaluated for, grounded in the science of human conversation, with a measurability rating and a concrete measurement idea for each. This is Part B of What the Science Says We Should Eval For — lifted out because it is the compendium's most directly usable artifact.

For the theory each criterion rests on, read the full dynamics report; for the sources, see the bibliography.

Consolidated criteria

Measurability key: CA = cheap-automatable · JR = judge-required · HSO = human-study-only · OP = open-problem.

#	What to eval for	Grounded in (construct + citation)	Why it matters for a group-chat agent	Measurability	Measurement idea
1	Type-fitted responsiveness — when addressed with a first pair part, return the relevant second pair part; don't answer one owed by another party; surface unanswered group questions	Adjacency pairs & conditional relevance (Schegloff & Sacks 1973; Schegloff 2007)	The minimal unit of being a competent interlocutor; "answered for the wrong person" is a core failure	CA	Tag dialogues with FPP type + intended addressee; check next contribution is type-fitting and addressed to the right party (exact-match on addressee)
2	Addressee / participant-status tracking — respond only when ratified-as-addressed; stay attentive-but-silent on human-to-human talk; resolve "you"/@-mentions	Participation framework & footing (Goffman 1981; Goodwin & Goodwin 2004); Bell 1984	THE multi-party construct; "the bot answered a question meant for Bob"	CA	Per-turn gold labels for "should the agent speak now?" + "who is the addressee?"; score precision/recall on intrusion vs. silence-when-needed
3	Footing / attribution integrity — when relaying or quoting, mark animator-vs-principal; attribute claims to their source, don't self-author them	Footing: animator/author/principal (Goffman 1981)	Blurring "I think X" with "the user said X" misattributes responsibility and fabricates authority	JR	Planted relays; judge or pattern-check whether relayed content is sourced to the right principal vs. asserted bare
4	When-to-speak / floor-share discipline — speak when selected or floor is genuinely open and unfilled; don't grab the floor; keep share proportionate; draw in quiet members	Turn-taking systematics (Sacks/Schegloff/Jefferson 1974); floor-control (Edelsky 1981)	An agent reading every silence as its cue dominates; one that never self-selects is furniture	JR (volumetric proxies CA)	Confusion matrix of {should-speak / should-stay-silent} at annotated TRPs; cheap proxies: turn-share ratio, mean inter-turn gap, reply-when-unaddressed rate
5	Barge-in restraint / graceful yield — withhold or retract when a human is mid-thought or was selected; distinguish collaborative completion from floor-grab	Overlap & overlap-resolution (Schegloff 2000; Jefferson)	Posting over a human, or stepping on an answer another was about to give, is the text analogue of interruption — but not all collisions are hostile	JR	Inject "human about to respond" scenarios (typing indicators/timing); measure barge-in rate and defer/edit-vs-duplicate after a near-simultaneous human message
6	Self-clarification before acting — ask a clarifying question on an under-specified/ambiguous request rather than acting on a guess	Repair: preference for self-initiation (Schegloff/Jefferson/Sacks 1977)	Acting on a misread goal is costly; the repair system says clarify first	CA	Self-clarification rate on a set of deliberately under-specified prompts (gold "ambiguous" flag)
7	Graceful other-correction & correction-uptake — flag others' errors mildly, leaving room to self-correct; accept being corrected without doubling down or over-apologizing	Repair: preference for self-repair / mitigated other-initiation (Schegloff/Jefferson/Sacks 1977); face-work (Goffman 1955)	Blunt "actually, you're wrong" in front of an audience is a major social-cost failure mode	JR	Plant human errors + corrections of the agent; rubric scores mitigation level of agent's corrections (bald vs. softened) and uptake vs. defensiveness
8	Multi-turn action coherence under interleaving — resume and complete a base sequence after N intervening turns; read pre-sequences as pres; emit closing thirds	Sequence organization & expansion (Schegloff 2007)	Side-talk constantly separates a first pair part from its second; "losing the plot" is common	JR (closing-third presence CA)	Interleave a target task with distractor side-conversations at controlled depths; measure completion-under-interruption and thread-drop rate
9	Recipient design / audience calibration — tune explicitness, jargon, and presupposed knowledge to the actual present mix; don't over-explain to experts or under-explain to novices	Recipient design (Sacks & Schegloff 1979; Clark & Murphy 1982); audience design (Bell 1984)	A heterogeneous audience means one-size answers fail in both directions	JR (jargon-density/readability proxies CA)	Same query, varied participant profiles (expert / novice / mixed); judge rates whether explicitness and references fit the present recipients
10	Common-ground maintenance — track per-participant shared knowledge; seek/offer evidence of understanding; repair on confusion signals; avoid both ungrounded reference and redundant over-grounding	Grounding (Clark & Brennan 1991; Clark & Wilkes-Gibbs 1986)	People join late and miss messages; assuming your knowledge is shared confuses, over-grounding bores	JR (unresolved-referent / redundant-re-explanation counts CA)	Transcripts with planted grounding events (newcomer joins; ambiguous referent; signaled non-understanding); rubric scores repair-initiation and reference-within-common-ground
11	Per-participant theory of mind / leakage control — maintain separate who-knows-what models; don't disclose to a party what they shouldn't know; resolve references per the right perspective	Egocentric anchoring / ToM (Keysar et al. 2000; Premack & Woodruff 1978)	Egocentric failure causes leaks, confusing references, answers pitched to the wrong knowledge state	CA (leak detection) / JR (belief attribution)	Asymmetric-knowledge tasks: A learns X privately, B doesn't; probe whether agent answers B without leaking X and attributes correct beliefs
12	Sycophancy / conformity resistance (with appropriate updating) — hold a justified position under social pressure; distinguish "new evidence" from "you simply disagree"; DO update on real evidence	Informational vs. normative influence (Deutsch & Gerard 1955; Asch 1951/56)	False consensus or an assertive high-status user shouldn't flip a correct factual stance	CA	Confederates assert wrong answer with no evidence (measure flip-rate, want low) vs. supply corrective evidence (measure update-rate, want high) — both against ground truth
13	Status-fair attention & credit — weight contributions by content, not inferred status cues; don't systematically favor the highest-status speaker or under-credit low-status ones	Status Characteristics / Expectation States (Berger et al. 1972, 1977)	An agent imports biased performance expectations rather than weighting on merit	CA	Hold content fixed, vary status cues (title, demographic name, assertiveness); measure differential agreement, credit, response length, deference
14	FTA-mitigation calibration — scale face-redress to the act's weightiness; neither bald-on-record bluntness nor so much hedging the content is lost	Face & FTAs (Brown & Levinson 1987; Goffman)	Corrections/refusals/disagreements are witnessed FTAs; mis-handling reads rude or evasive	JR	Scenarios requiring an FTA toward a named member; judge rates accuracy + redress presence + over/under-mitigation; check redress scales with FTA weight
15	Repair / recovery after rupture — perform proportionate corrective face-work after own error or an interpersonal rupture; no over-apology loop, no escalation	Face-work, deference & demeanor (Goffman 1955; 1956/1967)	A sustained encounter left unrepaired stays awkward; demeanor must avoid grovel and over-confidence both	JR	Inject a planted error/rupture; score next turn against a repair rubric (acknowledgment present, proportionate, forward-moving); count escalation vs. de-escalation
16	Maxim adherence — Quantity / Relation / Quality interaction — appropriately brief and relevant for chat tempo; don't re-answer settled points; don't assert beyond evidence (esp. hedge unsupported claims)	Cooperative Principle & maxims (Grice 1975)	Agents notoriously over-contribute in multi-party settings; the novel layer is truth × face × amount	CA (length/relevance) / JR (relevance nuance)	Turn-length target band for chat register + penalize info-dumps and re-answers; cross-check factual claims vs. ground truth and penalize unhedged false assertions
17	Implicature & indirect-directive comprehension — act on implicated meaning; recognize indirect/hinted requests and whether the agent is the (implicit) target — without over-eager action when it isn't	Implicature (Grice 1975); indirect speech acts (Searle 1969, 1975)	Group directives are routinely indirect and softened; the over-eager-action error is common	CA	Labeled indirect prompts with gold "intended action" + "who is the target?" answers, including distractors where the agent is NOT addressed
18	Accommodation appropriateness (convergence without over-/harmful-convergence) — converge on register/formality enough to be clear and affiliative; don't parrot slang, mirror hostility, or condescend; manage multiple styles in one thread	Communication Accommodation Theory (Giles 1973; Giles/Coupland/Coupland 1991); contextualization cues (Gumperz 1982)	Cold non-accommodation reads robotic; over-accommodation reads patronizing; mirroring hostility is harmful	JR (style-distance proxies CA)	Threads with divergent-style members (formal expert, casual newcomer, hostile user); judge rates per-addressee register match; flag over-accommodation and harmful convergence
19	Public-vs-private consistency — same quality and honesty in a large group as in a 1:1; no grandstanding, performative agreement, or degradation on hard tasks because many are watching	Social facilitation / evaluation apprehension (Zajonc 1965; Cottrell 1972)	The visible audience is the AI analogue of evaluation pressure	CA (performativity rating JR)	Identical prompts in 1:1 vs. large-audience condition; measure deltas in correctness, hedging, sycophancy, verbosity
20	Schism avoidance — don't fork the group into a private side-thread that fragments the conversation; select explicitly when wanting a specific actor	Schisming (Egbert 1997); next-speaker selection (Lerner 2003; Auer 2018)	Spawning a tangent that splits the floor degrades the whole group	JR (contested) / OP	Judge over the resulting transcript: did the agent's turn fork the conversation? (hard to operationalize cleanly)
21	Responsibility / effort calibration — know when it's on the hook vs. when another party owns the action; don't drop tasks on diffusion-of-responsibility; don't over-function and displace humans	Social loafing / free-riding (Latané/Williams/Harkins 1979; Karau & Williams 1993)	Ambiguous ownership in mixed human–agent groups causes dropped balls or agent over-reach	JR (drop/over-reach counts semi-CA)	Tasks with ambiguous ownership, some assigned to others; measure dropped-task rate (was responsible, didn't act) and over-reach rate (did another's task)
22	Group-polarization moderation — in a leaning group, surface counter-considerations / base rates / steelman the minority rather than supplying fresh one-sided arguments	Group polarization (Moscovici & Zavalloni 1969; Stoner 1961)	A persuasive agent can be an accelerant for pile-ons, risky plans, conspiracy	JR	Seed a leaning thread; compare agent-present vs. agent-absent extremity (judge-rated); count balancing vs. confirming/escalating moves
23	Cultural politeness portability — apply locally appropriate deference/address forms (honorific level by relative status) rather than transplanting one culture's directness everywhere	Discernment / wakimae (Matsumoto 1988; Ide 1989)	A correction fine among Western peers can be a serious face violation elsewhere	HSO (honorific grammar partially CA)	Honorific/T-V correctness against grammatical rules given known status config (partial); overall appropriateness needs native-speaker raters, reported per-locale

What our existing evals already cover vs. the gap

Our existing LLM-eval corpus and the group-chat survey measure agent task competence, and several criteria above are essentially re-groundings of things we already test:

Already covered (well). Criterion #2 (addressee/participant-status tracking) is our addressee accuracy / disentanglement work — Goffman's participation framework is the theory underneath a metric we already run. #4's volumetric side (floor-share) overlaps our when-to-speak F1. #11's leak-detection slice is our leakage metric, now grounded in egocentric-anchoring / ToM. #16's Quality dimension overlaps existing factuality/hallucination evals. #12 (sycophancy) is a known, partly-covered target. #14/#7's face dimensions are partially captured by our social-rubric judges. These should be re-labeled with their constructs, not re-built.
Genuinely new and untested. The constructs that the task-competence framing does not reach:
- Repair after misattribution / graceful other-correction (#6, #7) — we measure whether the agent gets the addressee right, not how it recovers when it (or a human) gets something wrong, nor the mitigation level of its corrections.
- Footing / participation-role tracking as attribution integrity (#3) — animator-vs-principal sourcing is not in the corpus at all; we test who is addressed, never whose words the agent is voicing.
- Face-calibration per recipient and FTA-redress scaling (#14, #15) — our social rubrics score politeness coarsely; scaling redress monotonically to FTA weight, and corrective face-work after a rupture, are untested.
- Common-ground maintenance as an ongoing process (#10) — we don't measure grounding/least-effort over a thread; over-grounding in particular is invisible to current metrics.
- Participation-equity as a NORM, not just a metric (#4, #13) — we count the agent's turn-share, but we don't test whether it actively draws in quiet members or distributes credit fairly across status cues. The normative move — "did it improve the group's participation balance?" — is new.
- Accommodation / convergence (#18) — register-matching appropriateness (and its failure modes: over-accommodation, harmful convergence) is untested.
- Also new: multi-turn coherence under interleaving (#8), public-vs-private consistency (#19), group-polarization moderation (#22), responsibility calibration (#21), schism avoidance (#20), and cultural politeness portability (#23).

Honest qualifier on the gap: many of the new criteria are judge-required and norm-laden, with a live risk that an LLM judge shares the agent's blind spots — so each needs an anchored rubric, reported inter-rater agreement, and (where possible) a human-rated calibration subset. The genuinely cheap additions are the ones with plantable ground truth (#1, #11-leak, #12, #13, #17).

The 3–5 highest-value additions

Criteria that are both important and at least judge-measurable, and that a curriculum could realistically teach:

Graceful repair & other-correction (#6 + #7). The single most teachable interpersonal skill and a top failure mode of assistant-style agents (blunt "you're wrong"). Plantable errors + an anchored mitigation rubric make it judge-measurable today, and self-clarification rate (#6) is cheaply automatable. Grounded in Schegloff/Jefferson/Sacks (1977).
Footing / attribution integrity (#3). Genuinely new, structurally clean ("did it attribute the relayed claim to the right principal?"), and high-stakes — blurring "the user said X" into "X is true" fabricates authority. Goffman (1981).
Common-ground maintenance (#10). Captures both under- and over-grounding, which current addressee-only metrics miss entirely; plantable grounding events (newcomer joins, ambiguous referent) give a judge concrete things to score. Clark & Brennan (1991).
Face-calibration / FTA-redress scaling (#14). A judge can rate whether redress scales with FTA weight while holding content constant — a crisp, teachable rubric — and it generalizes the coarse social-rubric judges we already have. Brown & Levinson (1987), with the cross-cultural caveat reported per-locale.
Participation-equity as a norm + sycophancy resistance (#4/#13 + #12). Pairs a cheap, ground-truthable metric (sycophancy flip-rate; status-cue bias at fixed content) with the normative upgrade — does the agent distribute the floor and credit fairly rather than just keeping its own turn-share low? Deutsch & Gerard (1955); Berger et al. (1972, 1977); Edelsky (1981).