Kashi — Speaker Identity / Diarization Perspective Technical research memo for developers Date: 2026-04-21 Purpose Turn the speaker-identity / diarization issue into concrete engineering doctrine, architecture rules, metric invalidation logic, and pilot acceptance criteria for Kashi. Bottom line Speaker identity is not a field. It is a probabilistic subsystem. If Kashi keeps talking as if speaker attribution is a clean deterministic input, the product stays easy to attack. Your current stack already depends on transcript + speaker attribution + timestamps, and several of your shipped detectors are person-directed or dyad-directed. That means diarization is not a small upstream detail. It is part of the measurement model, the fairness model, the contestability model, and the legal survivability model. The right doctrine is: 1. Separate “who spoke when” from “which real person that was.” 2. Separate “unknown speaker” from “wrong speaker.” Never collapse them. 3. Carry confidence and provenance through the full identity chain. 4. Suppress person-level and dyad-level outputs when speaker identity quality drops below threshold. 5. Make transcript/diarization dispute handling a first-class product workflow, not a support ticket. ====================================================================== 1. Why this matters for Kashi specifically ====================================================================== Kashi’s current materials already make speaker attribution foundational. The progress deck says the system pulls transcript + speaker attribution + timestamps from Zoom / Teams / Meet, then computes detectors from turn timing and speaker attribution alone. It also defines person-level and dyad-level longitudinal views. That means identity resolution is already a hard dependency, whether or not the deck fully admits it. At the same time, the legal/procedural work already identifies transcript and diarization error handling as a missing procedural requirement, not a side note. The measurement memo also says input quality must be part of confidence, and that poor transcript confidence, low diarization confidence, overlap-heavy segments, and multilingual confusion should down-rank or block output. So the project’s weak point is not “we forgot diarization exists.” The weak point is this: - the deck still sounds more deterministic than the real substrate is; - the metric layer is more identity-fragile than the rhetoric admits; - the dispute/correction layer is under-specified; - the cross-meeting identity model is not yet explicit. Critical correction: Kashi may be deterministic conditional on its inputs, but its identity inputs are not inherently deterministic. That distinction has to be made explicit. Otherwise “deterministic” sounds like “certain,” which is false. ====================================================================== 2. External technical reality: what current platforms and APIs actually show ====================================================================== Current platform behavior already proves that “speaker resolved” and “speaker truly identified” are not the same thing. Microsoft Teams: - Teams automatically identifies the speaker in captions/transcripts, but users can choose to hide their identity, which means even platform-native attribution is not always stable or fully exposed. - Teams Rooms intelligent speaker attribution for in-room people depends on enrolled voice profiles and invitation linkage; without that, room audio is attributed to the room instead of named individuals. Zoom: - Zoom smart name tags for voice can feed captions, transcripts, and summaries, but they depend on room setup, enrollment, invitation linkage, and in some modes manual editing. - Zoom’s own workflow explicitly allows editing smart voice name tags during the meeting, which means the attribution layer is not immutable ground truth. Generic speech APIs: - Azure diarization examples explicitly show intermediate “Speaker ID=Unknown” states before later labels like Guest-1 / Guest-2 appear. - Google Cloud diarization assigns speaker numbers, not verified real identities, and recommends providing expected speaker count to improve output. - Amazon Transcribe emits speaker_labels with generic values like spk_0, plus timestamps. Again: that is diarization, not durable person identity. Hard implication for Kashi: A vendor transcript speaker label is not a canonical person ID. It is evidence. Sometimes strong evidence. Sometimes weak evidence. Sometimes wrong evidence. Never treat it as a primary key by default. Also, overlap remains a core unsolved difficulty. Recent diarization challenge work still treats noisy, multilingual, overlap-heavy conditions as hard and still reports meaningful diarization error even in strong systems. So any meeting-governance product that pretends overlap is a solved edge case is bluffing. ====================================================================== 3. The technical doctrine Kashi should adopt ====================================================================== Kashi should model speaker identity as five separate layers: Layer A — Speech activity / segmentation Question: where is speech present, and where are turn boundaries likely to be? Output: segments with start/end timestamps and overlap flags. Failure mode: missed speech, false speech, broken boundaries. Layer B — Within-meeting diarization Question: which segments seem to belong to the same voice within this meeting? Output: diarization clusters (speaker_1, speaker_2, etc.). Failure mode: split one person into many clusters; merge many people into one cluster. Layer C — Participant-instance mapping Question: which diarization cluster maps to which meeting participant instance? Output: meeting_participant_instance_id. Failure mode: wrong roster match, guest/employee collision, room audio ambiguity, device-switch split. Layer D — Durable person resolution Question: is this the same real person across meetings? Output: canonical_person_id with confidence. Failure mode: cross-meeting drift, same display name collision, guest mistaken for employee, calendar mismatch. Layer E — Metric eligibility Question: is the identity chain good enough to allow this metric to exist? Output: metric allowed / downgraded / suppressed. Failure mode: pseudo-evidence shown as if valid. This separation is non-negotiable. If Layers B–D are flattened into one implied “speaker” field, Kashi will silently confuse diarization, participant mapping, and durable identity. ====================================================================== 4. The core engineering distinction: Unknown speaker vs Wrong speaker ====================================================================== Kashi needs explicit error semantics. 4.1 Unknown speaker Definition: A speech segment exists, but Kashi cannot responsibly attach it to a verified meeting participant. Examples: - room audio captured, but no enrolled room identity data; - guest joined from a phone bridge without reliable identity linkage; - transcript produced speaker_3 / Guest-1 / spk_2 but no safe roster match exists; - device switch produced a new unlabeled participant instance; - overlap or bad audio prevents stable assignment. System meaning: This is incomplete evidence, not false evidence. Unknown should increase suppression, not trigger silent forced assignment. 4.2 Wrong speaker Definition: Kashi currently attaches a segment to a person, but confidence/provenance or user dispute suggests that attachment is probably incorrect. Examples: - transcript says John, but meeting had two Johns and the wrong one was chosen; - guest “Kenji” was merged into employee Kenji Mori; - one person’s room speech was assigned to another enrolled voice; - cluster merged two speakers but downstream logic treated them as one person; - manual correction reveals the prior assignment was false. System meaning: This is contaminated evidence. Any person-level or dyad-level metric touching that assignment should be recomputed or invalidated. 4.3 Why the distinction matters If Unknown and Wrong are collapsed into one generic “low confidence” bucket, Kashi loses the ability to: - preserve epistemic honesty; - differentiate suppressible uncertainty from active contamination; - build a meaningful correction workflow; - explain why one metric is hidden and another is recomputed. Recommended status vocabulary - RESOLVED_CONFIRMED - RESOLVED_PROBABLE - UNKNOWN_UNRESOLVED - WRONG_SUSPECTED - WRONG_CONFIRMED - SPLIT_SUSPECTED - MERGE_SUSPECTED - OVERLAP_AMBIGUOUS ====================================================================== 5. Identity resolution architecture Kashi should build ====================================================================== 5.1 Canonical entities Person - canonical_person_id - account_user_id (nullable) - employee/guest flag - org_id - active state Meeting - meeting_id - platform - organizer_id - calendar_event_id - language_regime - meeting_type MeetingParticipantInstance - participant_instance_id - meeting_id - platform_participant_id - display_name - email / directory_id if present - join_source (desktop/mobile/room/phone/web) - room_flag - guest_flag - join_time / leave_time DiarizationCluster - cluster_id - meeting_id - speaker_label_raw (e.g. Guest-1 / spk_0 / speaker_2 / “Room”) - source_engine - cluster_confidence - overlap_ratio UtteranceSegment - utterance_id - meeting_id - cluster_id - start_ms / end_ms - transcript_text - transcript_confidence - overlap_flag - boundary_confidence IdentityEvidence - evidence_id - source_type (platform roster / platform transcript label / calendar invite / room voice profile / manual correction / participant self-claim / admin correction / prior meeting linkage) - subject_id - object_id - confidence - created_at - version IdentityMappingDecision - decision_id - meeting_id - utterance_id or cluster_id - participant_instance_id - canonical_person_id (nullable) - resolution_state - confidence - provenance_summary - supersedes_decision_id 5.2 Matching order (highest trust to lowest) Tier 1 — Platform-authenticated account identity Use when platform gives immutable user identity tied to the transcript/caption stream. This is the safest anchor. Tier 2 — Unique roster match Transcript speaker name matches exactly one active participant instance after normalization and there is no collision. Tier 3 — Enrolled room speaker tech Teams voice profile / Zoom smart voice tag / equivalent room identity signal, but only when enrollment, consent, and invite linkage conditions are satisfied. Tier 4 — Manual confirmed mapping User or authorized reviewer explicitly corrects mapping. Tier 5 — Probable mapping Used only for internal assistance, never for irreversible person-level claims. Example: one unlabeled cluster remains and one unmatched participant remains. Tier 6 — Unresolved pseudonym UnknownSpeaker_01 etc. Preserve separately. Do not force a person match. 5.3 Critical prohibitions Do NOT: - use display_name alone as a primary key; - merge cross-meeting identities purely from voice embeddings in v1; - auto-resolve guest names into employee identities without immutable anchor; - silently backfill unknowns into nearest person for the sake of prettier dashboards; - treat room-level attribution as person-level attribution; - treat diarization-cluster stability as durable human identity. ====================================================================== 6. How to handle the concrete failure cases in the prompt ====================================================================== 6.1 Stability across meetings Question: how stable is speaker attribution across meetings? Answer: Raw diarization labels are not stable across meetings. Stable cross-meeting identity should be built from a separate canonical person layer, anchored primarily in platform user/account identity and roster linkage, not from raw speaker embeddings alone. Recommendation: - within meeting: diarization cluster -> participant instance - across meetings: participant instance -> canonical person via account/calendar/directory linkage - only compute longitudinal per-person and per-dyad metrics when canonical link confidence clears threshold 6.2 Name collisions Examples: - two Kenjis in same org; - internal Kenji + external guest Kenji; - romanized name collision across JP/CN/KR users. Rules: - never key on display name; - require immutable participant or directory IDs where available; - preserve guest namespace separately; - UI should disambiguate by display_name + org/guest marker + participant instance, not by guessed merging. 6.3 Guest speakers Rules: - guest is a first-class entity type, not an exception; - guest canonical IDs should remain separate from employee canonical IDs; - if guest identity is weak, keep unresolved rather than coerced into roster. 6.4 Missing labels Rules: - unresolved clusters remain visible as UnknownSpeaker_n; - all dyad metrics involving that speaker should be suppressed or downgraded; - meeting-level telemetry may still be shown if segmentation quality is adequate. 6.5 Device switching Examples: - same employee joins from laptop, then phone; - participant moves from individual client to Teams Room / Zoom Room; - reconnect creates second participant instance. Rules: - maintain participant-instance layer separate from person layer; - merge instances only when platform account identity or explicit correction supports it; - do not infer same person solely because names are similar. 6.6 Room audio / in-room group speech This is one of the nastiest cases. Room audio often gives partial or grouped attribution rather than clean speaker ownership. Rules: - room speech without enrolled/verified in-room identity should not generate person-level metrics; - allow observational room-level metrics if useful, but block individual blame attribution; - if room identity tech is present, persist its provenance because it has different reliability characteristics from remote-client attribution. ====================================================================== 7. What downstream metrics break when diarization quality drops ====================================================================== This is the operational core. The product should know exactly which metrics become invalid when identity quality degrades. 7.1 Intrusive interruption Needs: - accurate speech boundaries; - accurate overlap detection; - correct speaker identity for interrupter and interrupted. Invalid when: - overlap ambiguity high; - initiator or target unresolved/wrong; - merged speakers; - room audio attribution only. Fallback: - show meeting-level overlap count only; - suppress person->person interruption matrix. 7.2 Chilling delta Needs: - credible trigger event identity; - stable target-speaker identity before and after trigger; - enough pre/post speaking opportunity. Invalid when: - target speaker split across clusters; - target identity unresolved; - trigger event itself attribution-ambiguous; - sparse exposure. Fallback: - suppress person-level chilling; - optionally retain meeting-level post-trigger participation anomaly, clearly marked as non-attributed. 7.3 Floor-time Gini Needs: - reasonably complete speech activity detection; - speaker segmentation not catastrophically broken. Fragile to: - severe speaker splitting (one person counted as many); - severe merging (many counted as one); - participant-count mismatch. Fallback: - keep as low-confidence meeting-level only if speech detection is decent; - do not over-interpret at person level under split/merge suspicion. 7.4 Unanswered-question rate Needs: - correct identity of the asker; - turn ordering integrity; - transcript quality sufficient for question detection and response mapping. Invalid when: - asker identity unstable; - transcript semantics weak; - multilingual or code-switch regime without calibration; - response speaker mapping ambiguous. Fallback: - suppress person-level rate; - optionally retain raw count of unresolved question-like events for review. 7.5 Topic-credit ignored-turns This is one of the most fragile detectors. Needs: - transcript semantics; - attribution across proposer / restater / credited speaker; - local topic continuity. Invalid when: - any involved speaker identity is unresolved or wrong; - transcript quality weak; - semantic similarity unreliable due to language mixing or ASR degradation. Fallback: - observation-only, or disable in red-quality meetings. 7.6 Agreement asymmetry Also identity-fragile. Needs: - correct directionality between source speaker and shifting speakers; - transcript semantics or stance logic; - meeting-type and cultural calibration. Invalid when: - source identity unstable; - responder identity unstable; - language/calibration weak; - participant count mismatch. Fallback: - suppress person-level directional inference. 7.7 Cross-meeting wrappers: continuity and baseline drift These are the hardest to defend under identity instability because they assume durable person continuity. Invalid when: - canonical person linkage confidence is weak; - same person’s meetings mix incompatible language regimes or meeting types without calibration; - repeated split/merge events create artificial trend movement. Fallback: - keep within-meeting observational metrics only; - do not generate longitudinal person or dyad claims. ====================================================================== 8. Recommended gating model ====================================================================== Kashi should compute quality at three levels: A. Meeting-level identity quality B. Person-level identity quality C. Metric-level eligibility 8.1 Meeting-level inputs - transcript confidence distribution - diarization confidence distribution - overlap ratio - unknown-speaker duration share - split/merge suspicion flags - participant-count mismatch (roster vs diarization clusters) - language regime (single-language / mixed-language / code-switch) - room-audio proportion - meeting-type confidence 8.2 Person-level inputs - percent of person-attributed speech backed by Tier 1/2 evidence - percent unresolved - percent disputed - number of participant instances merged - room-only attribution share - longitudinal linkage confidence 8.3 Initial conservative pilot thresholds (suggested defaults, not truth) Green - unknown speaker duration < 5% - participant-count mismatch = 0 or explainable - overlap ambiguity < 5% - no confirmed wrong-speaker incidents affecting scored events Amber - unknown speaker duration 5–15% - moderate overlap ambiguity - some unresolved participant mapping - show evidence grade and suppress fragile metrics Red - unknown speaker duration > 15% - confirmed merge/split affecting key speakers - wrong-speaker disputes unresolved - heavy mixed-language or code-switching without calibrated support - room audio dominant without verified in-room identity - only meeting-level observational telemetry allowed; no person-level review-worthy events Critical rule: Metric gating must be stricter than transcript availability. The existence of a transcript is not evidence that person-level inference is allowed. ====================================================================== 9. Contestability and correction workflow ====================================================================== This cannot be a support email. It has to be a state machine. 9.1 User-visible dispute reasons Allow at minimum: - “That wasn’t me.” - “Two people were merged.” - “My speech was split into multiple speakers.” - “Guest and employee were confused.” - “I switched device / room.” - “This was a mixed-language meeting.” - “The room audio was bad.” - “I was chair/facilitator.” - “This meeting type is unusual.” 9.2 Correction state machine PROPOSED -> UNDER_REVIEW -> ACCEPTED or REJECTED -> RECOMPUTED Rules: - raw utterance source remains immutable; - identity-mapping decisions are versioned, not overwritten; - recomputation of downstream metrics is automatic and audited; - any accepted “wrong speaker” correction invalidates affected historical outputs until recompute finishes. 9.3 Audit expectations For every displayed person-level signal, Kashi should be able to answer: - which utterances were used; - how those utterances were attributed; - what confidence band the attribution had; - whether any disputes touched them; - whether recomputation occurred. ====================================================================== 10. Critical product-language correction ====================================================================== Kashi should stop implying that speaker attribution is part of a clean structural substrate. Better wording: - “Kashi ingests transcript-linked meeting records including timestamps and speaker-attribution artefacts.” - “Employer-facing analytics are computed from structural interaction metadata, conditional on input-quality and attribution-quality gates.” - “Person-level outputs are suppressed when speaker identity confidence is insufficient.” - “Detectors are deterministic conditional on accepted input mappings; identity resolution itself is evidence-weighted and contestable.” Why this matters: That wording is more honest, more defensible, and paradoxically stronger. You are not weakening the product by admitting this. You are removing an easy attack surface. ====================================================================== 11. Recommended implementation plan for devs ====================================================================== P0 — Must exist before serious pilot 1. Canonical identity schema 2. Distinct UNKNOWN vs WRONG states 3. Metric eligibility engine 4. Meeting/person quality scoring 5. Dispute/correction workflow 6. Recompute + audit trail P1 — Strong next layer 7. Meeting-type-aware gating 8. Language-regime tags in baseline logic 9. Room-audio specific handling 10. Manual correction tools for admins / reviewers / users P2 — Later, only with care 11. More advanced participant-instance reconciliation 12. Optional room voice-profile integrations 13. Better semantic detectors after identity layer is stable What should not be prioritized early: - clever cross-meeting voice linking from embeddings; - fancy UI before identity auditability exists; - more semantic detectors before current identity fragility is fenced. ====================================================================== 12. Acceptance criteria ====================================================================== Engineering acceptance criteria - Every utterance has a provenance chain. - Speaker label raw string is never the only identity field stored. - Unknown and wrong speaker states are distinct in schema, API, and UI. - Person-level metrics cannot render when metric eligibility returns false. - Longitudinal trends require canonical_person_id confidence over threshold. - Every accepted correction triggers downstream recomputation. - Audit log records old mapping, new mapping, actor, timestamp, and affected metrics. Product acceptance criteria - Users can see why a person-level metric is hidden or downgraded. - Users can dispute identity assignment without escalating the broader case. - Reviewers can distinguish “not enough evidence” from “no issue.” - Meetings with bad identity quality degrade to observation-only, not fake certainty. Pilot acceptance criteria - track unknown-speaker share per meeting; - track correction rate per 100 meetings; - track percent of person-level outputs suppressed by quality gate; - track reviewer agreement on whether suppression was appropriate; - track time-to-recompute after accepted correction; - track how often disputes reveal split/merge vs pure transcript wording problems. ====================================================================== 13. Test pack Kashi should add immediately ====================================================================== At minimum, add synthetic + semi-synthetic regression cases for: 1. Same-name collision Two “Kenji” participants, one internal and one external guest. Expected: no silent merge. 2. Guest without immutable ID Transcript has “John,” roster has two possible Johns. Expected: unresolved pseudonym, not forced match. 3. Device switch mid-meeting Same user reconnects from phone. Expected: participant-instance split, person continuity only if supported. 4. Room audio without enrolled speaker tech Expected: no person-level attribution from room audio. 5. Room audio with enrolled speaker tech Expected: attribution allowed but provenance marked as room-voice-linked. 6. Overlap-heavy meeting Expected: interruption matrix suppressed if ambiguity too high. 7. One real speaker split into multiple diarization clusters Expected: person-level floor share and chilling invalidated until corrected. 8. Two real speakers merged into one cluster Expected: dyadic metrics suppressed. 9. Mixed-language / code-switch meeting Expected: semantic detectors suppressed or downgraded. 10. Early intermediate unknowns later resolved Expected: final mapping may upgrade, but only final accepted mapping feeds stable metrics. ====================================================================== 14. Final judgment ====================================================================== The speaker-identity problem is not “just diarization.” It is the point where Kashi’s technical, epistemic, and procedural fairness stories either become serious or collapse. The real standard is not “can we label who spoke in a demo?” The real standard is: Can Kashi say, for every person-level claim, - how that person was resolved, - how uncertain that resolution was, - what happens when the resolution is challenged, - and which metrics were suppressed because the system refused to bluff? If the answer becomes yes, the product gets materially stronger. If the answer stays vague, speaker identity remains one of the cleanest ways to break the whole thesis. ====================================================================== Source notes ====================================================================== Internal project docs used - Kashi — Progress & Project Overview (2026-04-21) - Kashi Measurement-Science Research Memo (2026-04-21) - Kashi Research Synthesis: Legal Defensibility, Procedural Fairness, and Governance Design (2026-04-21) - Kashi cross-cultural / multilingual strategy memo (2026-04-21) Current external technical references used - Microsoft Support: Hide your identity in meeting captions and transcripts in Microsoft Teams - Microsoft Support: Use Microsoft Teams Intelligent Speakers to identify in-room participants in a meeting transcription - Microsoft Learn / Azure AI Speech: real-time diarization quickstart - Zoom Support: Using smart name tags for voice in Zoom Rooms - Google Meet Help: Use Transcripts with Google Meet - Google Cloud Speech-to-Text docs: Detect different speakers in an audio recording - Amazon Transcribe docs: Partitioning speakers (diarization) - Interspeech 2024 DISPLACE challenge paper on overlap-heavy diarization