A Fresh Group-Chat Eval Framework

A proposal, not a survey. The two reports describe what the field measures and what the science says it should; this page tries to fuse them into one evaluation design that is buildable today.

The problem with the current shape

Every benchmark in the survey measures a slice: addressee accuracy here, when-to-speak F1 there, leakage somewhere else. Each is a clean classification task on a frozen transcript. None of them grade the thing that actually makes group chat hard: an agent has to make all of these decisions at once, in sequence, while the conversation moves underneath it — and an early wrong call (misreading who was addressed) silently corrupts every later one (it answers, leaks, and grandstands to the wrong person).

So the field has a pile of unit tests and no integration test. Worse, the unit tests measure task competence and skip the interactional competence the dynamics report argues is the real job: repair, footing, grounding, face, participation equity.

A fresh framework should do three things the current ones don't:

Grade a whole turn-decision as a pipeline, not isolated skills — and attribute failure to the stage that caused it.
Put "say nothing" on equal footing with "say the right thing." Restraint is half the skill and almost nothing scores it in text.
Score the interactional layer (did it repair gracefully? attribute correctly? keep the floor fair?) with anchored judge rubrics, calibrated against humans — not just task success.

The core idea: grade the turn-decision, not the turn

Reframe a group-chat agent as a thing that, at every incoming message, runs a decision cascade. Each stage is independently gradable, and — crucially — a failure at stage N voids the stages after it. This is borrowed directly from HSII's cascading score ι (a parse failure zeroes everything downstream), generalized into a full pipeline:

incoming message
   │
   ▼
[1] ATTEND   — is this conversation even relevant to me right now?
   │            (am I a ratified participant or an overhearer?)
   ▼
[2] SPEAK?   — should I respond, react minimally, or stay silent?
   │            (the floor-decision; silence is a valid, scored answer)
   ▼
[3] ADDRESS  — who is this for, and who am I speaking to?
   │            (addressee resolution; correct target)
   ▼
[4] GROUND   — what is shared knowledge here, and with whom?
   │            (track per-participant common ground before referring)
   ▼
[5] COMPOSE  — produce the content
   │
   ▼
[6] CONDUCT  — was it interactionally well-formed?
                (face-calibrated, correctly attributed, no leak, fair share)

Stages 1–4 are cheap and automatable against gold labels (classification / exact-match). Stages 5–6 are judge-required (rubrics). The pipeline framing is the contribution: it lets you say "the agent failed because it mis-attended at stage 1, not because its prose was bad" — which is exactly the diagnostic the pile-of-unit-tests approach cannot produce.

Cascade scoring

Score each stage 0–1; the turn score is the product up to the first hard failure, mirroring HSII's ι:

turn_score = attend × speak? × address × ground × compose × conduct

A blunt but honest property falls out: an agent that barges into a human-to-human exchange (fails ATTEND) scores 0 on the turn no matter how good its answer was — which is the correct verdict, and one no current output-graded benchmark returns.

The substrate: scripted scenarios with planted events

You cannot grade repair or grounding on a generic transcript — you have to plant the trigger. Borrowing the simulated-participant machinery from MUCA and ProMediate but adding gold-labeled injected events, each scenario is a short multi-party transcript (3–5 participants) seeded with one or more probes:

Probe	What it triggers	Stage tested	Gold label
Overheard exchange	two humans talk to each other; agent not addressed	ATTEND	"stay out"
Direct @mention buried mid-thread	agent addressed while side-talk continues	ATTEND, ADDRESS	"respond, to X"
Ambiguous referent	"can you fix it?" with two candidate "it"s	GROUND	clarify, don't guess
Newcomer joins	a participant enters mid-conversation	GROUND	re-ground, don't assume shared context
Asymmetric secret	A told the agent X privately; B asks around it	CONDUCT (leak)	answer B without revealing X
Planted human error	a participant states something false	CONDUCT (repair)	correct gently, leave room to self-fix
Agent's own error + correction	agent is corrected by a human	CONDUCT (uptake)	accept, don't double down or grovel
Status/sycophancy pressure	high-status user asserts a wrong answer, no evidence	CONDUCT (equity)	hold the correct position
Quiet participant	one member hasn't spoken in N turns	CONDUCT (equity)	draw them in, don't dominate

This is the bridge between the two reports: the probe is grounded in a specific construct from the dynamics report (overhearing → Goffman's participation framework; ambiguous referent → Clark & Brennan's grounding; gentle correction → Schegloff/Jefferson/Sacks repair), and the grading uses a mechanism from the survey (addressee accuracy, leakage rate, judge rubric).

The three scores

Don't collapse to one number. Report three, because they fail for different reasons and a single average hides which:

Competence — the cascade product across ATTEND→GROUND. Cheap, automatable, ground-truthable. "Did it route correctly?"
Conduct — the CONDUCT rubric: face-redress scaling, attribution integrity, repair quality, leak rate, participation fairness. Judge-scored against an anchored rubric, calibrated to human raters (the SOTOPIA validation pattern), with reported inter-rater agreement.
Reliability — run each scenario K times and report pass^k (all K correct), not pass@k. The survey's headline finding was that group-chat skills are near-chance; reliability will be brutal here, which is the point.

Guardrails the framework must carry

The survey's hardest-won lessons become non-negotiable design constraints:

Validate the simulated humans. Over-cooperative sims inflate every score (Sim2Real-Gap, RealUserSim). Report a simulator-fidelity number alongside agent scores, or the grade is unfalsifiable.
Judge the judge. The CONDUCT rubric is norm-laden; an LLM judge can share the agent's blind spots. Every CONDUCT dimension needs a human-rated calibration subset and reported agreement, or it doesn't ship.
Silence is scored, not skipped. ATTEND and SPEAK? must be able to reward not acting. A framework that only grades emitted text re-introduces the always-replies bias the field is stuck in.
No silent N-cap. Most work is dyadic/triadic and degrades as parties grow (Triadic VAP). State the participant count each scenario runs at; don't average a 3-party and 8-party result into one.

What this is and isn't

It is: an integration test for group-chat agents that (a) attributes failure to a stage, (b) scores restraint, (c) grades interactional conduct, and (d) carries the field's fidelity/calibration guardrails as first-class requirements. It is buildable now — every stage reuses a mechanism that already exists in the survey; the novelty is the cascade and the planted-probe substrate, not new ML.

It isn't: a live-human benchmark (still synthetic transcripts + simulated participants — the genuine frontier the survey flags as unmet), and it isn't a solved measurement of the norm-laden CONDUCT dimensions, which remain judge-dependent and culturally specific.

Smallest honest version (a teachable demo)

For the mentorship curriculum, the framework collapses to one demoable chapter — "Is this for me, and should I speak?" — that runs the first two stages (ATTEND, SPEAK?) on a hardcoded 4-person transcript with three planted probes (overheard exchange, buried @mention, ambiguous referent). Grade addressee accuracy against a majority-class baseline so the mentee viscerally sees the model barely beat always-guessing, and score silence as a correct answer. That is the whole framework in miniature: a cascade, a planted probe, restraint as a first-class outcome — finishable in under three minutes, and a true subset of the larger design above.