A proposal, not a survey. The two reports describe what the field measures and what the science says it should; this page tries to fuse them into one evaluation design that is buildable today.
The problem with the current shape
Every benchmark in the survey measures a slice: addressee accuracy here, when-to-speak F1 there, leakage somewhere else. Each is a clean classification task on a frozen transcript. None of them grade the thing that actually makes group chat hard: an agent has to make all of these decisions at once, in sequence, while the conversation moves underneath it — and an early wrong call (misreading who was addressed) silently corrupts every later one (it answers, leaks, and grandstands to the wrong person).
So the field has a pile of unit tests and no integration test. Worse, the unit tests measure task competence and skip the interactional competence the dynamics report argues is the real job: repair, footing, grounding, face, participation equity.
A fresh framework should do three things the current ones don't:
- Grade a whole turn-decision as a pipeline, not isolated skills — and attribute failure to the stage that caused it.
- Put "say nothing" on equal footing with "say the right thing." Restraint is half the skill and almost nothing scores it in text.
- Score the interactional layer (did it repair gracefully? attribute correctly? keep the floor fair?) with anchored judge rubrics, calibrated against humans — not just task success.
The core idea: grade the turn-decision, not the turn
Reframe a group-chat agent as a thing that, at every incoming message, runs a decision cascade. Each stage is independently gradable, and — crucially — a failure at stage N voids the stages after it. This is borrowed directly from HSII's cascading score ι (a parse failure zeroes everything downstream), generalized into a full pipeline:
incoming message
│
▼
[1] ATTEND — is this conversation even relevant to me right now?
│ (am I a ratified participant or an overhearer?)
▼
[2] SPEAK? — should I respond, react minimally, or stay silent?
│ (the floor-decision; silence is a valid, scored answer)
▼
[3] ADDRESS — who is this for, and who am I speaking to?
│ (addressee resolution; correct target)
▼
[4] GROUND — what is shared knowledge here, and with whom?
│ (track per-participant common ground before referring)
▼
[5] COMPOSE — produce the content
│
▼
[6] CONDUCT — was it interactionally well-formed?
(face-calibrated, correctly attributed, no leak, fair share)
Stages 1–4 are cheap and automatable against gold labels (classification / exact-match). Stages 5–6 are judge-required (rubrics). The pipeline framing is the contribution: it lets you say "the agent failed because it mis-attended at stage 1, not because its prose was bad" — which is exactly the diagnostic the pile-of-unit-tests approach cannot produce.
Cascade scoring
Score each stage 0–1; the turn score is the product up to the first hard failure, mirroring HSII's ι:
turn_score = attend × speak? × address × ground × compose × conduct
A blunt but honest property falls out: an agent that barges into a human-to-human exchange (fails ATTEND) scores 0 on the turn no matter how good its answer was — which is the correct verdict, and one no current output-graded benchmark returns.
The substrate: scripted scenarios with planted events
You cannot grade repair or grounding on a generic transcript — you have to plant the trigger. Borrowing the simulated-participant machinery from MUCA and ProMediate but adding gold-labeled injected events, each scenario is a short multi-party transcript (3–5 participants) seeded with one or more probes:
| Probe | What it triggers | Stage tested | Gold label |
|---|---|---|---|
| Overheard exchange | two humans talk to each other; agent not addressed | ATTEND | "stay out" |
| Direct @mention buried mid-thread | agent addressed while side-talk continues | ATTEND, ADDRESS | "respond, to X" |
| Ambiguous referent | "can you fix it?" with two candidate "it"s | GROUND | clarify, don't guess |
| Newcomer joins | a participant enters mid-conversation | GROUND | re-ground, don't assume shared context |
| Asymmetric secret | A told the agent X privately; B asks around it | CONDUCT (leak) | answer B without revealing X |
| Planted human error | a participant states something false | CONDUCT (repair) | correct gently, leave room to self-fix |
| Agent's own error + correction | agent is corrected by a human | CONDUCT (uptake) | accept, don't double down or grovel |
| Status/sycophancy pressure | high-status user asserts a wrong answer, no evidence | CONDUCT (equity) | hold the correct position |
| Quiet participant | one member hasn't spoken in N turns | CONDUCT (equity) | draw them in, don't dominate |
This is the bridge between the two reports: the probe is grounded in a specific construct from the dynamics report (overhearing → Goffman's participation framework; ambiguous referent → Clark & Brennan's grounding; gentle correction → Schegloff/Jefferson/Sacks repair), and the grading uses a mechanism from the survey (addressee accuracy, leakage rate, judge rubric).
The three scores
Don't collapse to one number. Report three, because they fail for different reasons and a single average hides which:
- Competence — the cascade product across ATTEND→GROUND. Cheap, automatable, ground-truthable. "Did it route correctly?"
- Conduct — the CONDUCT rubric: face-redress scaling, attribution integrity, repair quality, leak rate, participation fairness. Judge-scored against an anchored rubric, calibrated to human raters (the SOTOPIA validation pattern), with reported inter-rater agreement.
- Reliability — run each scenario K times and report pass^k (all K correct), not pass@k. The survey's headline finding was that group-chat skills are near-chance; reliability will be brutal here, which is the point.
Guardrails the framework must carry
The survey's hardest-won lessons become non-negotiable design constraints:
- Validate the simulated humans. Over-cooperative sims inflate every score (Sim2Real-Gap, RealUserSim). Report a simulator-fidelity number alongside agent scores, or the grade is unfalsifiable.
- Judge the judge. The CONDUCT rubric is norm-laden; an LLM judge can share the agent's blind spots. Every CONDUCT dimension needs a human-rated calibration subset and reported agreement, or it doesn't ship.
- Silence is scored, not skipped. ATTEND and SPEAK? must be able to reward not acting. A framework that only grades emitted text re-introduces the always-replies bias the field is stuck in.
- No silent N-cap. Most work is dyadic/triadic and degrades as parties grow (Triadic VAP). State the participant count each scenario runs at; don't average a 3-party and 8-party result into one.
What this is and isn't
It is: an integration test for group-chat agents that (a) attributes failure to a stage, (b) scores restraint, (c) grades interactional conduct, and (d) carries the field's fidelity/calibration guardrails as first-class requirements. It is buildable now — every stage reuses a mechanism that already exists in the survey; the novelty is the cascade and the planted-probe substrate, not new ML.
It isn't: a live-human benchmark (still synthetic transcripts + simulated participants — the genuine frontier the survey flags as unmet), and it isn't a solved measurement of the norm-laden CONDUCT dimensions, which remain judge-dependent and culturally specific.
Smallest honest version (a teachable demo)
For the mentorship curriculum, the framework collapses to one demoable chapter — "Is this for me, and should I speak?" — that runs the first two stages (ATTEND, SPEAK?) on a hardcoded 4-person transcript with three planted probes (overheard exchange, buried @mention, ambiguous referent). Grade addressee accuracy against a majority-class baseline so the mentee viscerally sees the model barely beat always-guessing, and score silence as a correct answer. That is the whole framework in miniature: a cascade, a planted probe, restraint as a first-class outcome — finishable in under three minutes, and a true subset of the larger design above.