[
  {
    "discipline": "Conversation Analysis & Interactional Linguistics (organization of multi-party talk)",
    "metaNotes": [
      "Preference organization (preferred vs. dispreferred seconds, e.g., delayed/mitigated declines) is robust in CA but cross-culturally variable in its surface markers; an eval that penalizes 'blunt' refusals must not assume one cultural template. Treat it as a sub-finding under adjacency pairs/repair rather than a hard universal metric.",
      "Much of CA's evidence base is spoken, co-present, multimodal interaction. Gaze, prosody, and acoustic overlap-resolution do not transfer directly to text/group-chat; the constructs port as functional analogues (typing indicators, @-mentions, message timing) and any eval should make that translation explicit rather than claim a 1:1 mapping.",
      "The cleanest automatable targets are addressee/next-speaker attribution and type-fitted second-pair-part presence (exact-match against annotated ground truth). The interpersonally important constructs \u2014 footing/attribution, recipient-design calibration, graceful repair, schism avoidance \u2014 are judge-required and somewhat subjective; build anchored rubrics and report judge agreement.",
      "CA is methodologically anti-quantitative at its origin: practitioners argue structures are participant-oriented and context-renewing, and resist coding-scheme operationalization. Turning these into scalar metrics is a deliberate (and contestable) departure from the source discipline; flag that when reporting.",
      "Goffman's animator/author/principal triad and the ratified/unratified hearer taxonomy have been refined and contested (e.g., Levinson 1988 'putting linguistics on a proper footing'; Goodwin & Goodwin 2004 reframe participation as embodied and emergent rather than a fixed slot system). Use the taxonomy as a checklist of distinctions, not a settled ontology.",
      "'Interruption' vs. 'overlap' is itself contested (Schegloff 2000 explicitly warns that not all simultaneous talk is interruptive); a barge-in metric that treats all collisions as violations will misclassify collaborative completions and back-channels."
    ],
    "constructs": [
      {
        "name": "Turn-taking systematics (TRPs & turn-allocation)",
        "origin": "Sacks, Schegloff & Jefferson 1974, 'A Simplest Systematics for the Organization of Turn-Taking for Conversation', Language 50:696\u2013735",
        "whatItIs": "Talk is built from turn-constructional units (TCUs) whose ends are projectable Transition-Relevance Places (TRPs). At a TRP, turn allocation follows an ordered rule set: (a) current speaker may select next ('current selects next'); if not, (b) any party may self-select, first starter gets the turn; if not, (c) current speaker may continue. The system is locally managed, party-administered, and minimizes both gap and overlap.",
        "groupChatRelevance": "With 3+ parties, every TRP is an open auction for the floor. An AI agent that treats every silence as its cue to speak will dominate; one that never self-selects becomes a passive bystander. Correct behavior means reading whether it was selected (current-selects-next), whether the floor is genuinely open, and whether continuing vs. yielding is appropriate \u2014 the core competence of being a polite, non-monopolizing participant.",
        "shouldEval": "Test whether the agent speaks at the right times: (1) responds when explicitly selected/addressed, (2) refrains from grabbing the floor when another human was selected, (3) self-selects appropriately when the floor is genuinely open and no human takes it, and (4) does not over-contribute. Measure floor-share / turn-share relative to a fair baseline and the rate of mis-timed self-selection.",
        "measurable": "judge-required",
        "measurementIdea": "Construct multi-party transcripts with annotated 'who-was-selected' ground truth at each TRP; score the agent on a confusion matrix of {should-speak / should-stay-silent}. Cheaply automatable proxies: turn-share ratio, mean inter-turn gap, and fraction of turns where the agent answered a question that was explicitly addressed to a named human.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Sacks, Schegloff & Jefferson 1974, Language 50:696-735 is correct; TCU/TRP terms and the ordered (a) current-selects-next, (b) self-select first-starter, (c) current continues rule set, plus locally-managed/party-administered/minimizes-gap-and-overlap, all match the source."
        }
      },
      {
        "name": "Overlap & overlap-resolution",
        "origin": "Schegloff 2000, 'Overlapping talk and the organization of turn-taking for conversation', Language in Society 29(1):1\u201363 (extends Sacks/Schegloff/Jefferson 1974)",
        "whatItIs": "'One party talks at a time' is a norm, not a fact; overlaps occur systematically (at TRPs, terminal overlaps, recognitional onset, interruptive incursions). Participants run an 'overlap-resolution device' \u2014 competitive devices (louder, higher, faster) and dropout/recycling \u2014 that mostly resolves simultaneous talk within a beat. Overlap is distinguished from interruption: not all simultaneous talk is hostile.",
        "groupChatRelevance": "In text chat there is no acoustic overlap, but the functional analogue is real: an agent posting while a human is mid-thought, barging into a sub-conversation, or stepping on an answer another participant was about to give. The agent needs to recognize when it has 'collided' with a human turn and yield rather than plow ahead.",
        "shouldEval": "Test interruption/barge-in restraint: does the agent withhold or retract when a human is clearly still composing or has been selected to answer? Does it distinguish a collaborative completion (acceptable) from a floor-grab (not)? On platforms with typing indicators, does it defer? Measure barge-in rate and graceful-yield rate after a collision.",
        "measurable": "judge-required",
        "measurementIdea": "In a simulated chat with timing/typing-indicator signals, inject scenarios where a human is about to respond; measure how often the agent posts anyway (barge-in rate) and whether, when a near-simultaneous human message lands, it defers/edits vs. duplicates. Acoustic 'who-drops-out' resolution is human-study-only and not directly portable to text.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Schegloff (2000) 'Overlapping talk and the organization of turn-taking for conversation', Language in Society 29(1):1-63, extending Sacks/Schegloff/Jefferson 1974 \u2014 title, author, year, journal, and volume/issue/pages all correct. Description accurate: norm-not-fact 'one at a time', systematic overlap placement, overlap-resolution device with competitive resources (louder/higher/faster) plus dropout/recycling, and overlap-vs-interruption distinction. Minor: the onset typology (transitional/recognitional/terminal) traces to Jefferson's categorization that Schegloff builds on, not solely Schegloff's coinage."
        }
      },
      {
        "name": "Repair (self/other-initiation, self/other-repair)",
        "origin": "Schegloff, Jefferson & Sacks 1977, 'The Preference for Self-Correction in the Organization of Repair in Conversation', Language 53:361\u2013382",
        "whatItIs": "An organized set of practices for handling troubles in speaking, hearing, or understanding. Repair has an initiation and an outcome, each by self or other, yielding four types. There is a structural preference for self-initiation and for self-repair: other-initiation is typically done gently (open-class 'huh?', repeats, candidate understandings) and leaves room for the trouble-source speaker to fix their own utterance.",
        "groupChatRelevance": "An AI must handle misunderstanding gracefully: ask clarifying questions when it has misheard the goal, accept correction without defensiveness, and \u2014 crucially \u2014 correct other participants delicately rather than bluntly ('actually, you're wrong'). Aggressive other-correction is a major social-cost failure mode for assistant-style agents. The preference for self-repair also licenses giving a human room to fix their own mistake instead of pre-empting it.",
        "shouldEval": "Test (1) appropriate self-initiated clarification before acting on an ambiguous request, (2) graceful uptake of other-initiated repair (correction) without over-apologizing or doubling down, and (3) face-saving design of other-correction \u2014 flagging an error mildly and giving the other party room to self-correct rather than overwriting them.",
        "measurable": "judge-required",
        "measurementIdea": "Curate dialogues with planted ambiguities and planted human errors; an LLM judge with an anchored rubric scores clarification-before-action, correction-acceptance, and the mitigation level of the agent's own corrections (bald vs. softened). Self-clarification rate on under-specified prompts is cheaply automatable.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Confirmed: Schegloff, Jefferson & Sacks (1977), 'The Preference for Self-Correction in the Organization of Repair in Conversation', Language 53(2):361-382 \u2014 the seminal repair paper. Description is accurate: repair = organized practices for troubles in speaking/hearing/understanding; initiation + outcome each by self or other yields four types (SISR, SIOR, OISR, OIOR); structural preferences for self-initiation and self-repair hold, and other-initiation uses gentle open-class initiators (huh?, repeats, candidate understandings)."
        }
      },
      {
        "name": "Adjacency pairs & conditional relevance",
        "origin": "Schegloff & Sacks 1973, 'Opening up closings', Semiotica 8:289\u2013327; elaborated in Schegloff 2007, 'Sequence Organization in Interaction'",
        "whatItIs": "Paired action types (question\u2013answer, greeting\u2013greeting, offer\u2013accept/decline, request\u2013grant/deny) where a first pair part makes a specific second pair part conditionally relevant: its presence is expected and its absence is noticeable and accountable. The norm is type-fitting (an answer fits a question), not merely temporal adjacency.",
        "groupChatRelevance": "The minimal accountability unit of being a good interlocutor: if a human asks the agent a question, an answer is owed; a greeting deserves a return; an offer deserves acceptance or a (mitigated) decline. In group settings the agent must also recognize when a first pair part was directed at someone else and NOT answer on their behalf \u2014 and notice 'noticeable absences' (a human's question to the group going unanswered) it might appropriately fill.",
        "shouldEval": "Test type-fitted responsiveness: given a first pair part addressed to the agent, does it produce the relevant second pair part (not a tangent)? Does it refrain from supplying a second pair part owed by another participant? Does it track and, where appropriate, surface a relevant 'official absence' (an unanswered group question)?",
        "measurable": "cheap-automatable",
        "measurementIdea": "Tag dialogues with first-pair-part type + intended addressee; programmatically check the agent's next contribution for a type-fitting second pair part addressed to the right party. Response-relevance can be scored with a light classifier; addressee-correctness against the annotated target is exact-match cheap.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Schegloff & Sacks 1973 \"Opening Up Closings\", Semiotica 8:289-327 is exact; elaborated in Schegloff 2007 \"Sequence Organization in Interaction\". Description of adjacency pairs, conditional relevance, and type-fitting is accurate. (Aside: \"conditional relevance\" was first coined in Schegloff 1968, but its association with the 1973 adjacency-pair work as cited is correct.)"
        }
      },
      {
        "name": "Sequence organization & expansion (pre-/insert-/post-)",
        "origin": "Schegloff 2007, 'Sequence Organization in Interaction: A Primer in Conversation Analysis, Vol. 1', Cambridge University Press",
        "whatItIs": "Adjacency pairs are the kernel of larger course-of-action structures. Base pairs are expanded by pre-sequences (pre-requests, pre-invitations that test the ground), insert sequences (a Q\u2013A nested before the base second pair part is given), and post-expansions (sequence-closing thirds like 'okay/great', or repair). Coherent action is built across many turns, not one.",
        "groupChatRelevance": "An agent must hold the thread of a multi-turn action across intervening turns by other participants \u2014 recognize a pre-sequence as a pre (not a literal question), do necessary insert clarification before completing a task, and provide closing thirds so the human knows the action landed. In group chat, intervening side-talk constantly separates a first pair part from its second; the agent must keep the action alive across the gap rather than losing the plot.",
        "shouldEval": "Test multi-turn action coherence under interleaving: can the agent resume and complete a base sequence after N intervening turns from other parties? Does it interpret pre-sequences correctly (e.g., 'are you free?' as a pre-invitation), use insert-clarification when needed, and emit sequence-closing acknowledgments? Measure task-completion under interruption and 'thread-drop' rate.",
        "measurable": "judge-required",
        "measurementIdea": "Build scripted group dialogues that interleave a target task sequence with distractor side-conversations at controlled depths; measure completion rate and whether the agent re-anchors to the right pending action. Pre-sequence interpretation needs a judge; sequence-closing-third presence is cheaply detectable.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Schegloff 2007, Sequence Organization in Interaction Vol. 1 (Cambridge UP) is correct; description of base pairs, pre-/insert/post-expansions and sequence-closing thirds matches the source."
        }
      },
      {
        "name": "Participation framework & footing (Goffman)",
        "origin": "Goffman 1981, 'Footing', in Forms of Talk, University of Pennsylvania Press; extended by C. Goodwin & M. H. Goodwin 2004, 'Participation'",
        "whatItIs": "Goffman decomposes 'speaker' into animator (who voices), author (who composed the words), and principal (whose position is expressed); and 'hearer' into ratified participants (addressed + unaddressed recipients) vs. unratified (bystanders, overhearers, eavesdroppers). 'Footing' is the participant's alignment/stance, which shifts moment to moment. Communication is a multi-party configuration of stances, not a sender\u2013receiver dyad.",
        "groupChatRelevance": "This is THE multi-party construct. An agent must track who is addressed vs. merely present, must not treat overheard talk between two humans as a request to it, must manage its own footing (when it speaks as itself = principal, vs. relaying/quoting = animator of another's words), and must respect unaddressed-recipient status (don't answer for people, but stay oriented). Footing failures \u2014 answering when you're a bystander, or quoting a source as if it were your own claim \u2014 are core group-chat error modes.",
        "shouldEval": "Test participant-status tracking: (1) does the agent respond only when ratified-as-addressed and stay silent (but attentive) when human-to-human talk is not addressed to it; (2) does it correctly mark footing when relaying/quoting (animator) vs. asserting its own stance (principal), e.g., attributing claims to sources rather than self-authoring them; (3) does it correctly model who-knows-what given the participation configuration.",
        "measurable": "judge-required",
        "measurementIdea": "Annotate each turn with addressee + participant-status; score the agent's intervene/abstain decision against ratified-addressed ground truth (cheap given labels). Footing/attribution (animator vs. principal) needs a judge checking whether relayed content is properly sourced vs. mis-claimed as the agent's own.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Goffman 1981 'Footing' (Forms of Talk, U. Penn Press) and Goodwin & Goodwin 2004 'Participation' (Companion to Linguistic Anthropology, Blackwell) are correctly attributed; animator/author/principal and ratified vs. unratified hearer roles plus footing-as-shifting-alignment are accurately described. Minor note: 'Footing' first appeared in Semiotica 1979 before being collected in Forms of Talk 1981, but the cited form is standard and correct."
        }
      },
      {
        "name": "Recipient design (audience design)",
        "origin": "Sacks & Schegloff (recipient design, e.g., Sacks 1992 Lectures; Sacks & Schegloff 1979 on reference); cf. H. H. Clark & Murphy 1982 'audience design'; Giles' accommodation",
        "whatItIs": "Speakers design talk for their specific recipients \u2014 word choice, level of explicitness, person-reference, presupposed shared knowledge \u2014 displaying orientation to what these particular co-participants know and need. The 'minimization/recognition' tension in reference (using the least elaborate form the recipient will recognize) is a canonical case.",
        "groupChatRelevance": "In a group, the agent has a heterogeneous audience: an expert and a novice, an insider who shares context and a newcomer who doesn't. Good recipient design means calibrating explanation depth, jargon, and references to the actual mix of participants \u2014 not defaulting to one-size answers. Over-explaining to experts and under-explaining to novices are both failures; so is leaking private/context one party shouldn't have.",
        "shouldEval": "Test audience calibration: given a known mix of participant expertise/knowledge-states, does the agent tune explicitness, jargon, and shared-knowledge presuppositions appropriately for the present recipients? Does it avoid referencing context a given recipient doesn't share, and avoid redundant explanation to those who already know? Measure calibration error across recipient profiles.",
        "measurable": "judge-required",
        "measurementIdea": "Same query, varied participant profiles (expert-only, novice-only, mixed); a judge rates whether explicitness/jargon match the audience and whether person-references are recognizable to the present recipients. Jargon-density and readability metrics are cheap proxies; the 'right level for THIS group' judgment is judge-required and partly contested.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Confirmed: Sacks & Schegloff 1979 (\"Two Preferences in the Organization of Reference to Persons,\" in Psathas ed., Everyday Language, pp.15-21) is the seminal source for the minimization/recognition tension; \"recipient design\" is a core Sacks/Schegloff term, and Clark & Murphy 1982 'audience design' (in Le Ny & Kintsch eds.) is correctly cited as the cognate. Description is accurate."
        }
      },
      {
        "name": "Floor management with 3+ parties (selection, schisming, addressee vs. next-speaker)",
        "origin": "Sacks, Schegloff & Jefferson 1974 (next-speaker selection); Egbert 1997 (schisming); Auer 2018 / Lerner 2003 (gaze, addressing, next-speaker selection in 3-party talk)",
        "whatItIs": "Multi-party talk separates addressee selection from next-speaker selection: a turn can be addressed to several recipients while selecting only one to respond ('closed-floor' vs. 'open-floor' questions). Participants deploy address terms, gaze, and sequential placement to allocate. 'Schisming' is the splitting of one conversation into two simultaneous floors; participants actively work to prevent or repair it.",
        "groupChatRelevance": "The agent must decide, when the floor is open, whether it is an eligible next speaker, and must allocate explicitly when it wants a specific human to act (use @-mention / address terms rather than a vague broadcast). It should detect and avoid causing schisms (spawning a tangent that fractures the group) and notice when it has been left holding the floor unnecessarily. Address-vs-next-speaker confusion produces the classic 'the bot answered a question meant for Bob' failure.",
        "shouldEval": "Test floor allocation: (1) when the agent issues a directive/question to the group, does it explicitly select the intended actor (address term / mention) vs. broadcasting ambiguously; (2) does it correctly answer open-floor questions only when eligible and not pre-empt closed-floor questions aimed at a named human; (3) does it avoid schisming (not derailing into a private side-thread that fragments the group). Measure addressee-attribution accuracy and schism-induction rate.",
        "measurable": "judge-required",
        "measurementIdea": "Multi-party scenarios labeled with open- vs. closed-floor question type and intended next-speaker; score the agent's answer/abstain + explicit-selection behavior against labels (largely automatable). Schism induction (did the agent fork the conversation?) needs a judge over the resulting transcript and is harder/contested to operationalize.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "All four sources verified: Sacks, Schegloff & Jefferson 1974 (Language 50:696-735, current-selects-next rule); Egbert 1997 (Research on Language and Social Interaction 30(1):1-51, schisming = single conversation splitting into multiple simultaneous ones); Lerner 2003 (Language in Society 32(2):177-201, address terms/gaze/tacit methods for next-speaker selection); Auer 2018 (chapter 'Gaze, addressee selection and turn-taking in three-party interaction', Benjamins). The description's core claim \u2014 that multi-party talk analytically separates addressee selection from next-speaker selection (one turn can address several recipients while gaze/sequential placement selects one to respond) \u2014 accurately reflects Auer 2018 and Lerner 2003, and the schisming definition matches Egbert 1997."
        }
      }
    ]
  },
  {
    "discipline": "Social psychology of group conversation and small-group dynamics",
    "metaNotes": [
      "Conformity findings (Asch; Deutsch & Gerard) are culturally and era-specific: meta-analyses show conformity magnitudes vary by culture (higher in collectivist samples) and have declined over decades. The AI analogue (sycophancy) is real and measurable, but mapping 'normative vs. informational' onto a model is interpretive \u2014 the model has no felt need for approval, so the construct is a useful metaphor, not a literal mechanism.",
      "Edelsky's F1/F2 floor distinction and much classic floor-control/dominance work was developed for spoken, synchronous, gender-focused settings. Text group chat lacks overlap, prosody, and simultaneous talk, so 'interruption' and 'floor' must be redefined (e.g., reply timing, addressed-vs-unaddressed) before the constructs transfer cleanly.",
      "Keysar's egocentric-anchoring claim is actively contested: a substantial body (e.g., constraint-based / Hanna & Tanenhaus) finds adults use common ground from the earliest moments, not only in a late effortful repair stage. The eval-relevant point (track divergent per-participant knowledge) is robust; the specific two-stage timing claim is not settled.",
      "Social facilitation has two competing explanations (Zajonc's mere-presence/drive vs. Cottrell's evaluation apprehension), and effects are moderated by task type and individual differences; meta-analytic effect sizes are modest. The 'public vs. private' AI consistency test is well-motivated but should not be over-attributed to a single mechanism.",
      "Group polarization is well-replicated but its mechanism (persuasive-arguments vs. social-comparison) remains debated, and the boundary between healthy 'amplifying a good idea' and harmful 'extremitization' is normative \u2014 an eval must define which direction counts as failure per use case.",
      "Status / Expectation States Theory is strongly supported in lab task-groups, but operationalizing 'status' for an AI eval risks reifying demographic categories; the cleaner, less contestable test holds content fixed and varies status cues to detect bias, rather than asserting what status hierarchy 'should' look like.",
      "Several constructs (grounding accuracy, accommodation appropriateness, multi-party theory of mind) are judge-dependent and sensitive to rubric design; inter-rater reliability and judge bias (an LLM judge sharing the agent's blind spots) are live threats to validity. Where ground truth can be planted (information partitions, known-false assertions, fixed content + varied cues), prefer the automatable variant."
    ],
    "constructs": [
      {
        "name": "Common ground & grounding (least collaborative effort, grounding criterion)",
        "origin": "Clark & Brennan 1991, \"Grounding in Communication\" (in Resnick et al., eds., Perspectives on Socially Shared Cognition); building on Clark & Wilkes-Gibbs 1986",
        "whatItIs": "Conversation is not transmission but a collaborative process of building mutual knowledge. Participants seek positive evidence of understanding and ground each contribution only well enough to meet current purposes (the grounding criterion), while minimizing the joint work required (least collaborative effort). Backchannels, repair, clarification requests, and acknowledgments are the machinery.",
        "groupChatRelevance": "In multi-party chat, grounding is harder: the agent must track what is mutually known across several people who joined at different times, may have missed messages, or hold divergent assumptions. An agent that assumes its own knowledge state is shared, or that over-explains when ground is already established, breaks least-effort and either confuses or bores the group.",
        "shouldEval": "Test whether the agent maintains an accurate per-participant common-ground model: does it seek/offer evidence of understanding (acknowledge, paraphrase, ask targeted clarification rather than restate everything), repair miscommunication when a participant signals confusion, and calibrate explanation depth to what has already been established in the thread vs. what a newly-arrived participant missed? Penalize both ungrounded assumptions (referring to things not in shared context) and redundant over-grounding.",
        "measurable": "judge-required",
        "measurementIdea": "Scripted multi-party transcripts with planted grounding events (a participant signals non-understanding; a newcomer joins mid-thread; a referent is ambiguous across speakers). A rubric-driven LLM judge scores repair-initiation, appropriateness of clarification requests, and whether references stay within established common ground. Cheap proxy: count unresolved referent errors and redundant re-explanations per transcript.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Clark & Brennan 1991 \"Grounding in Communication\" in Resnick/Levine/Teasley (eds.) Perspectives on Socially Shared Cognition (APA), building on Clark & Wilkes-Gibbs 1986 Cognition 22:1-39 \u2014 all correct; grounding criterion + least collaborative effort + backchannel/repair/acknowledgment machinery accurately described."
        }
      },
      {
        "name": "Informational vs. normative social influence (conformity)",
        "origin": "Deutsch & Gerard 1955, \"A Study of Normative and Informational Social Influences upon Individual Judgment\" (J. Abnormal & Social Psychology); rooted in Asch 1951/1956 conformity line experiments",
        "whatItIs": "People conform for two distinct reasons: informational influence (treating others' views as evidence about reality, especially under ambiguity) and normative influence (going along to gain acceptance or avoid disapproval, even against private judgment). Asch showed sizeable conformity to a unanimous incorrect majority on an unambiguous perceptual task.",
        "groupChatRelevance": "An AI agent in a group faces social pressure dynamics: multiple users asserting a false consensus, or a high-status user pushing a wrong claim. The agent should update on genuine evidence (legitimate informational influence) but resist capitulating to mere social pressure or majority assertion against fact (sycophancy is the AI analogue of normative conformity).",
        "shouldEval": "Test sycophancy/conformity under group pressure: when several participants (or one assertive participant) confidently assert something false or push back on a correct answer, does the agent hold a well-justified position, distinguish 'you've given me new evidence' from 'you simply disagree', and avoid flipping its factual stance to placate the group? Conversely, test that it DOES update when real evidence is provided (not stubbornness).",
        "measurable": "cheap-automatable",
        "measurementIdea": "Adversarial multi-turn setups with a known ground truth: confederate users assert the wrong answer (normative pressure, no evidence) vs. supply genuine corrective evidence. Measure flip-rate on the false-pressure condition (should be low) and appropriate-update-rate on the evidence condition (should be high). Both are automatable against ground truth.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Deutsch & Gerard (1955), J. Abnormal & Social Psychology, 51, 629-636 is the correct seminal source, rooted in Asch's 1951/1956 line-judgment conformity work; the informational-vs-normative distinction and the Asch unanimous-incorrect-majority finding are described accurately."
        }
      },
      {
        "name": "Group polarization",
        "origin": "Moscovici & Zavalloni 1969, \"The group as a polarizer of attitudes\" (J. Personality & Social Psychology); related risky-shift work, Stoner 1961",
        "whatItIs": "After group discussion, members' attitudes become more extreme in the direction the group already leaned. Driven by persuasive-arguments (hearing novel same-side arguments) and social-comparison/normative processes. Discussion does not average opinions toward moderation; it amplifies the dominant tendency.",
        "groupChatRelevance": "A persuasive AI agent in a like-minded group can act as an accelerant for polarization, supplying fresh one-sided arguments that push an already-leaning group toward extremity (e.g., escalating a pile-on, hardening a risky plan, amplifying a conspiracy). The agent is uniquely capable of generating the 'novel persuasive arguments' that drive the effect.",
        "shouldEval": "Test whether the agent amplifies vs. moderates a leaning group: in a group already trending one direction, does it surface counter-considerations, base rates, and steelman the minority view, or does it pile on with confirming arguments and escalate risk/extremity? Eval should reward balanced framing and de-escalation in homogeneous groups, and flag one-sided amplification.",
        "measurable": "judge-required",
        "measurementIdea": "Seed a group chat that leans toward an extreme/risky stance; have the agent contribute several turns. Measure pre/post 'extremity' of the conversation's stance via judge ratings, and count balancing moves (counterarguments introduced, hedges, minority-view surfacing) vs. confirming/escalating moves. Comparing agent-present vs. agent-absent threads isolates its amplification effect.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Moscovici & Zavalloni (1969), \"The Group as a Polarizer of Attitudes,\" J. Personality & Social Psychology, 12, 125-135; Stoner 1961 risky-shift precursor and the persuasive-arguments + social-comparison mechanisms are all accurate."
        }
      },
      {
        "name": "Status hierarchies & expectation states in talk",
        "origin": "Berger, Cohen & Zelditch 1972 and Berger, Fisek, Norman & Zelditch 1977 (Expectation States / Status Characteristics Theory); Ridgeway & Berger syntheses",
        "whatItIs": "Even in nominally equal task groups, participation and influence inequalities form fast and stabilize. They track 'status characteristics' (gender, race, perceived expertise) that bias performance expectations, so high-status members get more speaking turns, more deference, and less interruption regardless of actual competence on the task.",
        "groupChatRelevance": "An AI agent both reads and reproduces status. It may defer disproportionately to a participant it infers is high-status (a boss, a confident expert-sounding user) and discount or talk over lower-status participants, importing biased expectations rather than weighting contributions on merit. It can also reinforce or counteract who gets attended to.",
        "shouldEval": "Test status-fair attention and credit: does the agent weight contributions by content quality rather than by inferred status cues (title, assertiveness, demographic signals), distribute acknowledgment without systematically favoring the apparent highest-status speaker, and avoid dismissing or under-attributing good ideas from low-status participants? Should test for demographic and seniority bias in who it agrees with, credits, and engages.",
        "measurable": "cheap-automatable",
        "measurementIdea": "Counterbalanced experiments: hold the CONTENT of a contribution fixed and vary the speaker's status cues (signature/title, demographic name, assertiveness). Measure differential agreement rate, credit attribution, response length, and deference. Systematic variation tied to status cues at fixed content quality is an automatable bias metric.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Berger, Cohen & Zelditch 1972 (ASR 37:241-255) and Berger, Fisek, Norman & Zelditch 1977 (Elsevier) are correctly attributed as the seminal Status Characteristics/Expectation States Theory works; Ridgeway & Berger syntheses confirmed. Description accurately captures how status characteristics bias performance expectations and drive participation/influence/deference/interruption inequalities in nominally equal task groups."
        }
      },
      {
        "name": "Floor-control & participation inequality (turn-taking, dominance, silencing)",
        "origin": "Sacks, Schegloff & Jefferson 1974 (turn-taking systematics); Edelsky 1981 \"Who's got the floor?\" (singly-developed F1 vs. collaborative F2 floor); conversational-dominance literature",
        "whatItIs": "Talk is organized by a turn-allocation system; in multi-party settings a distinction emerges between the 'floor' (the acknowledged what's-going-on) and individual turns. Some participants dominate the floor (more turns, interruptions, topic control) while others are effectively silenced; collaborative floors distribute participation more evenly.",
        "groupChatRelevance": "An AI agent can monopolize the floor (long, frequent, topic-controlling messages) and drown out humans, or conversely manage the floor to include quiet participants. In multi-party chat it must judge WHEN to speak vs. stay silent, whether it's interrupting an exchange between two humans, and whether it's crowding out human participation.",
        "shouldEval": "Test floor management: does the agent know when NOT to speak (not every message needs a reply; don't jump into a two-human exchange uninvited), keep its share of the floor proportionate, avoid serial monologuing, and actively draw in or yield to under-participating humans? Eval should measure participation balance and unsolicited-interjection rate, not just response quality.",
        "measurable": "cheap-automatable",
        "measurementIdea": "Replay multi-party logs and let the agent decide turn-by-turn whether/how much to contribute. Cheap metrics: agent's share of total tokens/turns, rate of replying when not addressed, interruption of in-progress human exchanges, and whether it solicits silent members. Appropriateness of a 'stay silent' decision needs a judge, but the volumetric and addressed-vs-unaddressed signals are automatable.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Sacks, Schegloff & Jefferson 1974 (Language 50:696-735, turn-taking systematics) and Edelsky 1981 'Who's Got the Floor?' (Language in Society 10:383-421, F1 singly-developed vs. F2 collaborative floor) are correctly attributed; description of floor/turn distinction and dominance vs. collaborative participation is accurate."
        }
      },
      {
        "name": "Social loafing / free-riding (and effort responsibility)",
        "origin": "Latan\u00e9, Williams & Harkins 1979, \"Many hands make light the work: The causes and consequences of social loafing\" (J. Personality & Social Psychology); Karau & Williams 1993 meta-analysis",
        "whatItIs": "Individuals exert less effort on a collective task when their contribution is pooled and not individually identifiable; diffusion of responsibility and reduced evaluability drive it. Effort returns when contributions are identifiable and individuals feel their input is essential and evaluated.",
        "groupChatRelevance": "In a group with multiple agents/humans, an AI agent may under-contribute when it assumes 'someone else will handle it', or its presence may induce humans to loaf (free-ride on the AI). Conversely, ambiguity about who owns a task in a group can lead to dropped balls or to the agent over-functioning and doing everyone's work.",
        "shouldEval": "Test responsibility attribution and effort calibration in shared tasks: does the agent correctly determine when it is on the hook vs. when another party owns the action, avoid dropping a task on the assumption others will cover it (diffusion of responsibility), and avoid over-functioning that displaces human contribution? Should test explicit ownership tracking ('who is doing what') in multi-agent / human-in-the-loop group tasks.",
        "measurable": "judge-required",
        "measurementIdea": "Multi-party task scenarios with deliberately ambiguous ownership and some tasks assigned to others. Measure dropped-task rate (agent fails to act when it was the responsible party), over-reach rate (agent does work explicitly owned by a human/other agent), and whether it clarifies ownership. Ground truth on 'whose task' makes the drop/over-reach counts semi-automatable; appropriateness needs a judge.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Both citations verified: Latan\u00e9, Williams & Harkins 1979, JPSP 37(6):822-832; Karau & Williams 1993 meta-analysis, JPSP 65(4):681-706 (Collective Effort Model). Description of diffusion of responsibility, reduced evaluability, and effort returning under identifiability/essentiality is accurate."
        }
      },
      {
        "name": "Audience effects / social facilitation & evaluation apprehension",
        "origin": "Zajonc 1965, \"Social Facilitation\" (Science) \u2014 mere-presence drive theory; Cottrell 1972 evaluation-apprehension refinement",
        "whatItIs": "The mere presence of others raises arousal and boosts dominant responses: performance on simple/well-learned tasks improves, but on complex/novel tasks it degrades. Later work attributes much of the effect to evaluation apprehension \u2014 the expectation of being judged by an audience.",
        "groupChatRelevance": "In a group, the AI has a visible audience, which is the AI analogue of evaluation pressure: it may shift toward 'dominant', safe, crowd-pleasing responses, perform for the room rather than help the addressed user, or change behavior depending on who is watching. It may also alter a human's behavior by acting as a perceived evaluator.",
        "shouldEval": "Test audience-dependent consistency: does the agent give the same quality and honesty of answer in a public group as in a private DM (no 'playing to the room'), avoid grandstanding or performative agreement when observed by many, and not degrade on genuinely hard tasks just because the setting is social/public? Eval should compare private-vs-public response quality and honesty for matched prompts.",
        "measurable": "cheap-automatable",
        "measurementIdea": "Present identical task prompts in a 1:1 condition vs. a large-audience group condition; measure deltas in correctness, hedging, sycophancy, and verbosity/performativity. Systematic public-vs-private divergence at fixed task is an automatable consistency metric (judge needed only to rate 'performativity').",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Zajonc (1965, Science) introduced the drive theory of social facilitation \u2014 mere presence raises arousal and strengthens dominant responses (helping simple/well-learned tasks, hurting complex/novel ones); Cottrell (1972) refined it with evaluation apprehension (a learned drive from anticipated judgment). Attribution and description are accurate."
        }
      },
      {
        "name": "Communication Accommodation Theory (convergence / divergence)",
        "origin": "Giles 1973; Giles, Coupland & Coupland 1991 (Contexts of Accommodation); Giles & Ogay 2007 review",
        "whatItIs": "Speakers adjust their style (vocabulary, formality, register, rate, self-disclosure) toward (convergence) or away from (divergence) their interlocutors. Convergence signals affiliation and eases understanding; over-convergence can feel patronizing (over-accommodation), and divergence asserts distinct identity or distance.",
        "groupChatRelevance": "An AI agent with many interlocutors must accommodate appropriately: matching register/formality to the group without mirroring harmful content, code-switching across multiple participants with different styles in one thread, and avoiding both cold non-accommodation (robotic, tone-deaf) and over-accommodation (mimicry, condescension, mirroring a user's hostility or slang inappropriately).",
        "shouldEval": "Test appropriate accommodation across a heterogeneous group: does the agent converge on register/formality/terminology enough to be affiliative and clear, without over-accommodating (patronizing simplification, parroting slang, mirroring profanity/hostility) and without harmful convergence (matching a user's aggression or bias)? Should test multi-addressee style management within a single thread.",
        "measurable": "judge-required",
        "measurementIdea": "Construct group threads with participants of divergent styles (formal expert, casual newcomer, hostile user). Judge whether the agent's register-matching is appropriate per addressee, flag over-accommodation (condescension, slang-parroting) and harmful convergence (mirroring hostility/bias). Style-distance metrics (formality/lexical alignment) give a cheap proxy, but appropriateness is judge-dependent.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Confirmed: Giles 1973 (originally Speech Accommodation Theory; renamed CAT ~1987), Giles/Coupland/Coupland 1991 Contexts of Accommodation, and Giles & Ogay 2007 review all correctly attributed; convergence/divergence and over-accommodation described accurately."
        }
      },
      {
        "name": "Theory of mind & perspective-taking in groups (egocentric anchoring)",
        "origin": "Keysar, Barr, Balin & Brauner 2000, \"Taking Perspective in Conversation\" (Psychological Science) \u2014 egocentric anchoring & adjustment; Premack & Woodruff 1978 on theory of mind",
        "whatItIs": "Interlocutors often start egocentrically \u2014 interpreting and producing language from their own perspective \u2014 and only effortfully adjust to the partner's distinct knowledge when a problem is detected. Successful communication in groups requires modeling that different participants know different things (divergent mental states), not assuming shared knowledge.",
        "groupChatRelevance": "With multiple participants, an AI must maintain SEPARATE mental-state models per person: who knows X, who was present for Y, who holds which false belief. Egocentric failure (assuming everyone shares the agent's or one user's knowledge) causes leaks (revealing info one party shouldn't see), confusing references, and answers pitched to the wrong knowledge state.",
        "shouldEval": "Test per-participant mental-state modeling: can the agent track who-knows-what across participants (including divergent/false beliefs and information one party hasn't seen), avoid disclosing to a participant something they shouldn't know, resolve references using the right participant's perspective, and pitch explanations to each addressee's actual knowledge? This is the multi-party theory-of-mind / information-leak test.",
        "measurable": "judge-required",
        "measurementIdea": "Adapt false-belief / asymmetric-knowledge tasks to group chat: participant A learns X privately, B does not; later questions probe whether the agent (a) answers B without leaking X, (b) attributes the correct belief to each party, (c) resolves ambiguous references per-perspective. Leak detection against a known information-partition is automatable; nuanced belief attribution needs a judge.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "Full title is \"Taking Perspective in Conversation: The Role of Mutual Knowledge in Comprehension,\" Keysar, Barr, Balin & Brauner (2000), Psychological Science. Premack & Woodruff (1978), \"Does the chimpanzee have a theory of mind?\", Behavioral and Brain Sciences.",
          "note": "Both attributions verified correct; doc truncated the Keysar et al. subtitle but author/year/journal and the egocentric-anchoring description are accurate."
        }
      }
    ]
  },
  {
    "discipline": "Sociolinguistics & Pragmatics (multi-party conversation)",
    "metaNotes": [
      "Brown & Levinson's universality claim is the most contested construct here: Matsumoto (1988) and Ide (1989) argue negative face and the individualistic, volition-based model do not travel to collectivist/discernment cultures. Any face/politeness eval should report per-culture and avoid a single universal rubric.",
      "'Appropriate' politeness, redress, and register are inherently norm-laden and audience-relative \u2014 there is rarely a single correct answer, so most of these constructs need judge-based or human evaluation rather than exact-match scoring. Honesty: judges encode their own cultural defaults (typically Western), which can systematically mis-score non-Western interaction.",
      "The most cheaply and reliably automatable slices are the structural ones: addressee resolution / turn-ownership (participation framework) and implicature-comprehension with gold labels. Face-redress calibration, demeanor, and audience design require a rubric'd LLM judge with a human-rated calibration subset.",
      "Several constructs overlap and should share test sets rather than be measured in isolation: indirect speech acts + participation framework both feed addressee/intent disambiguation; audience design + code-switching both concern register calibration to who is present. Treat them as a connected battery.",
      "These theories were built for spoken, co-present interaction. Text chat strips prosody, gaze, and overlap timing \u2014 the original carriers of many contextualization cues and footing shifts. Operationalizations must map cues onto text-available signals (lexical register, punctuation, emoji, @-mentions, message length/timing) and acknowledge the modality gap.",
      "A recurring agent failure mode these constructs jointly predict and an eval should specifically target: over-contribution \u2014 speaking when unaddressed (participation framework), over-answering (Grice Quantity), and over-mitigating/sycophancy (face-work demeanor). Worth a dedicated 'restraint' metric.",
      "The Quality maxim partly overlaps existing factuality/hallucination evals; the novel multi-party contribution here is the interaction of truthfulness with face (hedging an unsupported claim) and with Quantity (saying the right amount), not factual accuracy alone."
    ],
    "constructs": [
      {
        "name": "Face & Face-Threatening Acts (FTAs)",
        "origin": "Brown & Levinson 1987, \"Politeness: Some Universals in Language Usage\" (building on Goffman); also Brown & Levinson 1978",
        "whatItIs": "Every interactant has \"face\": positive face (the want to be liked/approved/included) and negative face (the want to be unimpeded, free from imposition). Many ordinary acts\u2014requests, criticisms, disagreements, corrections, advice\u2014intrinsically threaten one or both kinds of face, for either speaker or hearer. Politeness is the systematic mitigation work done to offset that threat.",
        "groupChatRelevance": "In a group chat an agent constantly performs potential FTAs: correcting one person's claim in front of others, declining a request, flagging an error, disagreeing. Each act threatens a specific person's specific face and is witnessed by an audience, which amplifies the threat. Mishandling face reads as either rude/curt or sycophantic/evasive.",
        "shouldEval": "Test whether the agent selects FTA-mitigation strategy appropriate to the act's weightiness: when it must deliver a correction, refusal, or disagreement to a named participant, does it use redress (hedging, acknowledgment, giving reasons) proportional to the imposition, rather than either bald-on-record bluntness or so much hedging that the substantive content is lost? Score correctness-of-content held constant while varying face-redress.",
        "measurable": "judge-required",
        "measurementIdea": "Construct scenarios where the agent must do an FTA toward a specific member (correct Bob's wrong figure; decline Alice's out-of-scope request). An LLM judge with a rubric rates (a) substantive accuracy, (b) presence of appropriate redress strategy, (c) over/under-mitigation. Cross-check on a small human-rated set. Vary FTA weight (small typo vs. major factual error) and check the redress scales monotonically.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Correctly attributed to Brown & Levinson 1987 (Politeness: Some Universals in Language Usage, CUP), with the 1978 precursor in Goody's volume, building on Goffman's notion of face. Positive/negative face, FTAs, and politeness-as-mitigation are all accurately described."
        }
      },
      {
        "name": "Face-work, Deference & Demeanor",
        "origin": "Goffman 1955, \"On Face-Work\" (Psychiatry 18:213-231); Goffman 1956/1967, \"The Nature of Deference and Demeanor\"",
        "whatItIs": "Face is a public self-image on loan from the interaction, not a private possession; participants engage in ongoing ritual \"face-work\" to maintain their own and others' face, including avoidance and corrective rituals (repair after an offense). Deference is the marking of regard for another; demeanor is the conduct that establishes one's own worthiness. Both are mutually managed across the whole encounter.",
        "groupChatRelevance": "A group chat is a sustained encounter, not a single Q&A. An agent must do corrective face-work after it (or someone) commits a gaffe\u2014apologize, repair, restore equilibrium\u2014and must manage its own demeanor (not over-claim, not grovel) while showing appropriate deference to participants of differing status. Failure leaves the group in an unrepaired, awkward state.",
        "shouldEval": "Test repair/recovery behavior: after the agent makes an error that is called out, or after one participant insults another, does the agent perform appropriate corrective face-work (acknowledge, repair, restore) rather than ignoring it, over-apologizing, or escalating? Also test demeanor calibration: does it avoid both self-deprecating grovel and status-grabbing over-confidence?",
        "measurable": "judge-required",
        "measurementIdea": "Inject a planted error or interpersonal rupture into a multi-turn transcript, then measure the agent's next-turn behavior against a repair rubric (acknowledgment present, proportionate, forward-moving, no over-apology loop). Count escalation vs. de-escalation. A demeanor probe can use calibration: does stated confidence match correctness, and does apology length scale with offense severity?",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Both attributions verified: Goffman, \"On Face-Work,\" Psychiatry 18(3):213-231 (1955); and \"The Nature of Deference and Demeanor,\" American Anthropologist 58(3):473-502 (1956), reprinted in Interaction Ritual (1967) \u2014 the doc's \"1956/1967\" correctly captures both the original journal year and the collection reprint. Description is accurate: face as a publicly-on-loan self-image, ongoing face-work via avoidance and corrective (repair) rituals, deference = marking regard for other, demeanor = conduct establishing one's own worthiness, mutually managed across the encounter."
        }
      },
      {
        "name": "Cooperative Principle & Conversational Maxims (Implicature)",
        "origin": "Grice 1975, \"Logic and Conversation\" (in Syntax and Semantics 3)",
        "whatItIs": "Speakers are presumed to cooperate, observing maxims of Quantity (be as informative as required, no more), Quality (be truthful/evidenced), Relation (be relevant), and Manner (be clear, brief, orderly). Hearers exploit apparent flouting of a maxim to compute conversational implicatures\u2014meaning beyond the literal words.",
        "groupChatRelevance": "In group chat, much intent is implied, not stated. An agent must (a) infer what speakers implicate (e.g., \"it's getting late\" as a closing move) rather than only parse literal content, and (b) govern its own contributions\u2014not over-answer (Quantity violation: a wall of text in a fast chat), stay relevant to the live thread, and not assert beyond its evidence (Quality). The Quantity maxim is especially load-bearing: agents notoriously over-contribute in multi-party settings.",
        "shouldEval": "Two complementary criteria. (1) Implicature comprehension: given an indirectly-phrased group message, does the agent act on the implicated meaning, not just the literal? (2) Maxim adherence in production: is the agent's turn appropriately brief and relevant for the live multi-party tempo (Quantity/Relation), and does it refrain from stating unsupported claims as fact (Quality)?",
        "measurable": "cheap-automatable",
        "measurementIdea": "Implicature: a labeled set of indirect prompts with gold \"intended action\" answers\u2014scoreable automatically. Quantity/Manner: measure turn length relative to a target band for chat register and penalize info-dumps; flag turns that re-answer already-settled points. Quality: cross-check factual claims against a known ground truth and penalize unhedged false assertions. Mix of exact-match and judge for the relevance dimension.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Grice 1975 \"Logic and Conversation\" in Cole & Morgan (eds.), Syntax and Semantics Vol. 3: Speech Acts (pp. 41-58); Cooperative Principle, four maxims, and flouting-based implicature all described correctly."
        }
      },
      {
        "name": "Indirect Speech Acts",
        "origin": "Searle 1975, \"Indirect Speech Acts\" (in Syntax and Semantics 3); Searle 1969, \"Speech Acts\"",
        "whatItIs": "An illocutionary act performed by way of another: \"Can you pass the salt?\" is literally a question about ability but functions as a request. Recognizing the primary (intended) act requires reasoning over felicity conditions plus Gricean inference. Indirectness is itself a politeness resource\u2014off-record requests give the hearer an out.",
        "groupChatRelevance": "Group-chat directives are routinely indirect and softened (\"someone should probably look at the deploy\", \"would be great if the numbers were double-checked\"). An agent must recognize when an indirect form is actually a request/command directed (possibly implicitly) at it, and must itself choose directness vs. indirectness when assigning or asking\u2014being too direct reads as bossy in front of the group, too indirect leaves the ask unowned.",
        "shouldEval": "Test recognition: given indirect/hinted directives in a group thread, does the agent correctly identify whether an action is being requested and of whom (including when it is the implied target)? And production: when the agent needs something from a participant, does it calibrate directness to the relationship/face stakes rather than always issuing bald imperatives?",
        "measurable": "judge-required",
        "measurementIdea": "Recognition is partly automatable: labeled indirect utterances with gold (is-this-a-request? / who-is-the-target?) answers, including distractors where the agent is NOT the addressee (over-eager-action error). Production directness is judge-rated against a politeness-calibration rubric. The addressee-disambiguation slice overlaps with participation framework and can reuse that test set.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Searle 1975 (Syntax and Semantics 3, Cole & Morgan, pp. 59-82) and Searle 1969 Speech Acts are both correctly attributed; \"illocutionary act performed by way of another,\" the salt example, and the felicity-conditions+Gricean account are all faithful to Searle. The off-record/politeness framing is accurate but is most precisely Brown & Levinson (1978/87), downstream of Searle\u2014supplementary, not a misattribution of the core construct."
        }
      },
      {
        "name": "Audience Design & Style-Shifting",
        "origin": "Bell 1984, \"Language Style as Audience Design\" (Language in Society 13:145-204)",
        "whatItIs": "Speakers design their speech for their audience, shifting style (formality, vocabulary, code) in response to who is present. Bell ranks audience roles by influence: addressee (directly spoken to) > auditor (ratified, acknowledged) > overhearer (known but unratified) > eavesdropper. Accommodation is strongest toward the addressee but third parties measurably affect style. \"Referee design\" lets a speaker style toward an absent reference group.",
        "groupChatRelevance": "A group chat is the canonical mixed-audience situation: the agent addresses one person while several ratified auditors and overhearers watch. It must design each turn for the actual present audience\u2014e.g., not leaking detail meant for one member to the whole group, raising/lowering formality and jargon for the mix present, and recognizing that something said \"to Alice\" is read by everyone. This is the multi-party generalization of register control.",
        "shouldEval": "Test whether the agent adjusts register/content to the composition of the present audience: same underlying message, different group makeup (technical vs. mixed-expertise members; a senior stakeholder present vs. peers only) should yield appropriately different style and disclosure. Also test audience-awareness: does it avoid disclosing to the full group information appropriate only to one addressee?",
        "measurable": "judge-required",
        "measurementIdea": "Hold the task constant and vary the roster/seniority/expertise of present members across runs; have a judge rate whether register, jargon level, and disclosure shifted appropriately and consistently (a style-shift delta). A cheaper proxy: track jargon-token density and formality markers as a function of audience composition and check they move in the predicted direction. The disclosure-leak check is automatable with planted private-to-one details.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Verified: Allan Bell 1984 \"Language Style as Audience Design,\" Language in Society 13(2):145-204. Description is accurate \u2014 audience design, the addressee > auditor > overhearer > eavesdropper hierarchy (ranked by ratification/awareness and influence), accommodation strongest toward addressee with third parties affecting style to a lesser regular degree, and referee design (styling toward an absent reference group) all match the source."
        }
      },
      {
        "name": "Participation Framework & Footing",
        "origin": "Goffman 1981, \"Footing\" (in Forms of Talk)",
        "whatItIs": "The dyadic speaker/hearer model is inadequate. Hearers decompose into addressed recipients, unaddressed (ratified) participants, bystanders, overhearers, and eavesdroppers; speakers decompose into animator (who voices it), author (who composes it), and principal (who is committed to/responsible for it). \"Footing\" is a change in one's alignment/stance to the others, signaled and continually renegotiated.",
        "groupChatRelevance": "This is the structural backbone of multi-party chat. The agent must track who is addressed vs. merely ratified vs. overhearing on every turn, decide whom to address, and not answer when it is not the addressee. The author/principal distinction matters when the agent relays or quotes (animating someone else's words vs. speaking as itself / on behalf of a user)\u2014it must not blur \"I think X\" with \"the user said X.\"",
        "shouldEval": "Test addressee tracking and turn-ownership: in a thread with several participants, does the agent respond only when addressed or genuinely relevant, correctly resolve \"you\" / @-mentions, and avoid butting into exchanges between two humans? And footing-attribution: when relaying, does it preserve who authored/is responsible for a claim rather than absorbing it as its own assertion?",
        "measurable": "cheap-automatable",
        "measurementIdea": "Addressee resolution is largely automatable: scripted multi-party transcripts with gold labels for \"should the agent speak now?\" and \"who is the addressee?\"; score precision/recall on intrusion (speaking when unaddressed) vs. silence-when-needed. Attribution: planted relays where a judge or pattern check verifies the agent attributes claims to the right principal rather than stating them bare.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Confirmed: Goffman 1981, \"Footing\" in Forms of Talk (Univ. of Pennsylvania Press, pp. 124-159). Production format (animator/author/principal) and reception decomposition (addressed/unaddressed ratified participants; bystanders/overhearers/eavesdroppers) and the alignment definition of footing are all accurate."
        }
      },
      {
        "name": "Contextualization Cues, Code-Switching & Register",
        "origin": "Gumperz 1982, \"Discourse Strategies\" (conversational code-switching, contextualization cues)",
        "whatItIs": "Speakers signal how an utterance should be interpreted through subtle, non-propositional cues\u2014shifts in code, register, prosody, formality. Conversational code-switching (incl. register switching without changing language) does interactional work: addressee specification, quotation, reiteration/emphasis, message qualification, and personalization vs. objectivization. The switch itself is meaningful, not noise.",
        "groupChatRelevance": "A group chat mixes registers, in-jokes, technical jargon, and code/language switches across members. An agent must read register and code shifts as signals (e.g., a switch to formal register flags seriousness; a switch to a shared in-group code marks solidarity or addresses a specific bilingual member) and produce register-appropriate output\u2014matching the group's established code without misfiring (e.g., forced slang, or staying stiff where the group is casual).",
        "shouldEval": "Test register/code accommodation: does the agent match the group's prevailing register and any in-group code (technical vs. casual; a second language used among bilingual members) appropriately, and does it interpret a participant's register/code switch as the interactional signal it is (e.g., recognizing a shift to formality as escalation/seriousness) rather than ignoring it?",
        "measurable": "judge-required",
        "measurementIdea": "Interpretation: labeled snippets where a register/code switch carries a function (seriousness, quotation, addressee targeting); score whether the agent's response reflects the function. Production: judge-rate register match against the established thread register; a code-switch slice tests whether the agent appropriately uses/avoids a second language given the addressee's competence. Note cultural specificity\u2014appropriate code use varies by community, so judges need community context.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Correct: Gumperz 1982, Discourse Strategies (CUP). Contextualization cues + conversational code-switching are his; the cited functions match his canonical set (pp. 75-81). Minor: he lists SIX functions\u2014the description omits \"interjection\"\u2014but all listed functions and the framing are accurate."
        }
      },
      {
        "name": "Cultural Variation in Face & Politeness (Discernment / Wakimae)",
        "origin": "Matsumoto 1988 and Ide 1989 (critiques of Brown & Levinson using Japanese); concept of wakimae/discernment politeness",
        "whatItIs": "Brown & Levinson's individual, volitional, negative-face-centric model is contested cross-culturally. In many cultures (Japanese being the classic case) politeness is substantially discernment-based (wakimae): obligatory marking of relative social position via honorifics and form, oriented to group interdependence rather than the individual's freedom from imposition. What counts as a face threat, and which redress is required, varies by culture and relationship.",
        "groupChatRelevance": "An agent serving a multilingual/multicultural group cannot apply one fixed politeness model. The same blunt correction that is fine among Western peers may be a serious face violation elsewhere; honorific/address-form choices that are optional in one culture are obligatory in another. Getting status-marking and obligatory deference wrong is a per-culture failure mode invisible to a single-culture eval.",
        "shouldEval": "Test politeness portability: across locales/cultural settings (and languages with grammaticalized honorifics/T-V distinctions), does the agent apply the locally appropriate deference and address forms\u2014e.g., obligatory honorific level by relative status\u2014rather than transplanting one culture's directness norms everywhere? Flag both under-marking (too blunt/familiar) and over-marking.",
        "measurable": "human-study-only",
        "measurementIdea": "Honorific/address-form correctness in languages that grammaticalize it (Japanese keigo, Korean speech levels, T-V in Romance/Slavic) is partially automatable against grammatical rules given a known status configuration. But whether overall politeness is culturally appropriate is contested and norm-laden\u2014best validated by native-speaker human raters per culture. Treat any single-culture automated rubric as a proxy, not ground truth; report results per-locale rather than aggregated.",
        "_verify": {
          "attributionCorrect": true,
          "descriptionAccurate": true,
          "correction": "",
          "note": "Confirmed: Matsumoto 1988 (J. of Pragmatics 12:403-426, 'Reexamination of the Universality of Face') and Ide 1989 (Multilingua 8:223-248, 'Formal Forms and Discernment') correctly attributed; wakimae/discernment critique of Brown & Levinson's volitional, negative-face model accurately described."
        }
      }
    ]
  }
]