KASHI — INPUT-SUBSTRATE PERSPECTIVE Technical research memo for developers Date: 2026-04-21 Purpose Turn the “input-substrate perspective” into a dev-facing technical specification and critique. This is not pitch copy. This is the implementation-facing view of what enters the system, what must be normalized, what must be gated, and what must be suppressed when the substrate is weak. ================================================== 0. EXECUTIVE JUDGMENT ================================================== The input substrate is not a boring upstream detail. It is the validity boundary of the whole product. If the ingestion layer is sloppy, Kashi does not measure workplace asymmetry. It measures transcript failure, diarization drift, platform quirks, and mixed-language confusion. The current internal materials already imply the right direction: - Kashi starts from meeting transcripts, speaker attribution, and meeting metadata. - Kashi wants deterministic structural analysis where possible. - Kashi already recognizes L2 speakers, Japanese silence norms, chair role, and multilingual conditions as confounds. - Kashi’s multilingual memo explicitly says transcript quality and language conditions need a first-class gate. But the current project state still has one major technical contradiction: - The deck says “patterns, not content, not affect,” “never transcribe for analysis,” and “none read meeting content.” - At the same time, shipped / named detectors such as unanswered-question rate, topic-credit ignored-turns, and agreement-asymmetry already require transcript interpretation or semantic comparison. That contradiction must be resolved in code and in product docs. Not later. Now. There are only two honest technical options: Option A — Strict structural MVP Keep MVP genuinely structural-only. Allowed detector family: - speaking share / floor-time metrics - interruption / overlap / truncation - turn counts / turn durations - response latency - directed turn graph - longitudinal drift - simple chilling-delta if trigger definition remains structural Remove or disable in MVP: - unanswered-question rate - topic-credit ignored-turns - agreement-asymmetry - anything depending on “substantive response,” similarity, stance shift, or semantic interpretation Option B — Constrained hybrid pipeline Admit that some detectors are hybrid, not purely structural. Then build: - structural detector lane - transcript-semantic detector lane - detector eligibility gating per lane - per-detector confidence and abstention rules - stronger platform/language restrictions What is not acceptable: pretending semantic detectors are still “metadata only.” That is technically false and will get shredded by any serious reviewer. Recommended path: - MVP production lane = strict structural core - experimental lane = constrained hybrid detectors behind explicit gating and caveats ================================================== 1. WHICH PLATFORMS SHOULD BE FIRST-CLASS? ================================================== Current answer: Zoom, Microsoft Teams, Google Meet. That is already consistent with the concept note and current progress deck. But “supported” should not mean “one badge on a slide.” It should mean “documented input contract + tested parser + detector eligibility matrix.” Recommended ranking for first-class production support: Tier 1 — Microsoft Teams Why: - official docs explicitly state live transcription includes speaker names and timestamps - official docs expose spoken-language selection and language-mismatch correction flow - official docs expose participant notice and optional identity hiding in captions/transcripts - official docs document original transcript vs live translated transcription behavior Technical implication: Teams is the cleanest first-class substrate for person-linked turn analysis because the platform contract around speaker names, timestamps, spoken-language handling, and user notice is the most explicit. Tier 2 — Zoom Why: - official docs explicitly state cloud audio transcription generates VTT transcript files with timestamps - transcripts can contain unknown speakers whose names are manually edited later - supported languages for cloud audio transcripts are documented - Zoom explicitly warns transcription quality depends on language and audio quality Technical implication: Zoom is usable, but speaker identity quality and transcript quality must be treated as more conditional than Teams. Do not conflate Zoom AI Companion language breadth with cloud-audio-transcript detector readiness. A language supported somewhere in AI Companion is not automatically detector-grade for Kashi. Tier 3 — Google Meet Why: - official docs clearly document transcript availability, supported transcript languages, storage location, host/co-host control, default settings, and participant warning when auto-start is enabled - official docs for “take notes for me” explicitly say only one spoken language at a time is supported and multiple languages in the same meeting are not supported - however, Google’s official transcript docs are less explicit than Teams about speaker-attribution guarantees in the material reviewed here Technical implication: Meet is valid as a first-class platform for transcript ingestion, but should be treated as the most conservative platform for speaker-attribution-dependent detectors unless implementation testing proves otherwise. Recommended engineering policy: - first-class does not mean identical detector availability - each platform gets its own detector-eligibility matrix - every new platform/feature pair must pass a substrate validation checklist before it becomes production-grade ================================================== 2. WHAT RAW ARTIFACTS SHOULD KASHI INGEST? ================================================== Minimum ingestion artifacts A. Transcript-turn artifact Required fields: - source_platform - source_meeting_id - org_id / tenant_id - transcript_segment_id - speaker_source_id - speaker_display_name if available - start_timestamp_ms - end_timestamp_ms - raw_text - transcript_language - segment_confidence if platform provides it - source_file_reference (VTT/SRT/TXT/JSON/etc.) B. Speaker artifact Required fields: - speaker_source_id - normalized_speaker_id (internal) - display_name - participant_role_if_known (host, co-host, organizer, attendee, external) - identity_confidence - diarization_confidence if available - alias resolution history C. Meeting artifact Required fields: - meeting_id - org_id - platform - recording_id / transcript_id - start_at - end_at - duration_minutes - organizer_id - host_ids - participant_roster - participant_count - internal_vs_external flags - meeting_title - calendar_event_id if available - meeting_type candidate - transcript_enabled_by - recording_enabled_by - notice_mode (manual / automatic / unknown) D. Language regime artifact Required fields: - configured_spoken_language - detected_primary_language - detected_secondary_languages - single_language_flag - mixed_language_flag - code_switch_likely_flag - language_detection_confidence - locale_pack_candidate E. Quality artifact Required fields: - transcript_available_flag - transcript_confidence_band - speaker_attribution_confidence_band - overlap_quality_band - audio_quality_proxy if available - timestamp_integrity_flag - parser_warning_codes - eligibility_status_by_detector F. Calendar / context artifact Ingest only what helps normalization and permissions. Useful fields: - recurring_series_id - invitee_count - external_attendee_presence - department / team scope if contractually allowed - organizer org unit - scheduled duration - optional meeting labels from calendar or admin metadata Do not expand ingestion just because you can. The point is detector validity, not data hoarding. ================================================== 3. CANONICAL NORMALIZED INPUT CONTRACT ================================================== Kashi should not run detectors directly on raw platform exports. It needs a canonical normalized schema. Recommended canonical unit: Turn Turn { turn_id meeting_id speaker_id start_ms end_ms duration_ms raw_text normalized_text source_language text_confidence diarization_confidence overlap_before_ms overlap_after_ms interruption_candidate_flag turn_sequence_index parser_warning_codes[] } MeetingEnvelope { meeting_id org_id platform started_at ended_at duration_ms organizer_id participant_ids[] participant_count meeting_type_label meeting_type_confidence locale_pack language_regime quality_profile raw_source_refs[] } QualityProfile { transcript_confidence_band diarization_confidence_band overlap_quality_band language_regime_band parser_integrity_band detector_eligibility_map } Do not bury quality inside one hidden scalar. Quality must be explicit and queryable. ================================================== 4. PRE-ANALYSIS QUALITY GATES ================================================== This should be an explicit gate pipeline, not a vague “confidence” factor at the end. Gate 1 — Substrate presence gate Questions: - Is there a transcript at all? - Are timestamps present? - Are there speaker labels or at least stable speaker buckets? - Is meeting duration plausible? If fail: - do not run person-linked detectors - store ingestion failure state - surface “analysis unavailable: substrate incomplete” Gate 2 — Parser integrity gate Questions: - Did parsing create valid ordered turns? - Are timestamps monotonic enough? - Are there negative or zero durations at scale? - Did encoding/language handling break text? If fail: - block all downstream analytics - log parser defects - send artifact to parser QA queue Gate 3 — Speaker-attribution gate Questions: - Are speaker labels present consistently? - Are there too many Unknown Speaker segments? - Is diarization confidence below threshold? - Is identity stitching across meetings stable enough for longitudinal analysis? If weak: - allow only meeting-level aggregate metrics if safe - block dyadic / person-targeted detectors - never run interruption-directionality or manager-to-target concentration logic on low speaker-confidence substrate Gate 4 — Transcript-text gate Questions: - Is text confidence above threshold? - Is transcript sparsity abnormal? - Is overlap-heavy speech causing text corruption? - Is the transcript language consistent enough for semantic interpretation? If weak: - structural-only detectors may still run if timestamps and speakers are good - semantic or hybrid detectors must be suppressed Gate 5 — Language regime gate Questions: - Is this single-language or mixed-language? - Is code-switching likely? - Does the platform officially support this spoken language for the relevant feature? - Is this language supported only for note-taking but not transcript-grade analysis? If mixed-language or unsupported: - downgrade or suppress transcript-semantic detectors - split baseline streams by language regime - attach hard caveat to timing/latency interpretation Gate 6 — Meeting-type gate Questions: - Is meeting type known with enough confidence? - If unknown, is the fallback normalization safe? - Is this a greylisted meeting class (incident bridge, exec review, training, client meeting) where some detectors should be observational only? If unknown or unsupported: - do not emit review-worthy events by default - show observational metrics only Gate 7 — Sample sufficiency gate Questions: - Is there enough comparable exposure? - Is this person actually speaking enough to infer anything? - Is this longitudinal signal based on comparable meeting classes and language regimes? If insufficient: - abstain - do not force output - show “insufficient comparable exposure” instead of a weak pseudo-score ================================================== 5. DETECTOR ELIGIBILITY MATRIX ================================================== This is the most important implementation object. Every detector needs explicit prerequisites. A. Structural-only detectors 1) Speaking share / floor-time Gini Requires: - valid timestamps - minimally reliable speaker attribution Can survive: - weak text quality Cannot survive: - speaker identity collapse 2) Overlap / truncation / intrusive interruption Requires: - valid turn boundaries - strong enough diarization / overlap segmentation Can survive: - weak lexical transcript quality Cannot survive: - merged speakers, poor overlap segmentation 3) Response latency Requires: - valid turn sequence - reliable timestamps Needs caution: - language regime - meeting type - culture / locale pack 4) Directed turn graph / reciprocity skeleton Requires: - speaker identity quality - valid turn ordering Does not require: - semantic interpretation 5) Speaker baseline drift Requires: - stable person identity across meetings - adequate comparable history - meeting-type and language tags B. Hybrid / semantic detectors 6) Unanswered-question rate Hidden requirement: - question detection - "substantive response" determination - N-turn window semantics Meaning: This is not purely structural. Run only when: - transcript quality high - language regime single-language - supported language - semantic lane enabled Otherwise: - suppress 7) Topic-credit ignored-turns Hidden requirement: - semantic similarity or proposition matching - attribution of later credit Meaning: This is definitely not purely structural. Run only when: - transcript quality high - speaker attribution high - single-language or very well-supported same-language segments - semantic model calibrated for that language Otherwise: - suppress 8) Agreement-asymmetry / position shift logic Hidden requirement: - stance or position movement inference - semantic interpretation across turns Meaning: This is the weakest candidate for early production. Recommendation: - experimental lane only - no production review-worthy events from this detector until validated Engineering rule: Every detector returns: - score - evidence grade - reason codes - abstain flag - suppression reason if blocked No detector should silently degrade from "strong evidence" to "vibes." ================================================== 6. MIXED-LANGUAGE, CODE-SWITCHING, AND L2 HANDLING ================================================== This cannot be an afterthought. It must be a first-class branch in the pipeline. Required variables - primary spoken language - secondary spoken language(s) - mixed-language flag - code-switch-likely flag - participant-level language heterogeneity if inferable safely - locale pack - language-regime-specific baseline key Recommended policy Single-language + officially supported + good quality - full structural lane allowed - hybrid lane allowed only for detectors explicitly validated in that language Single-language + officially unsupported or low-confidence language handling - structural lane only if timestamps/speakers are solid - hybrid lane blocked Mixed-language or code-switch-heavy - meeting-level structural metrics allowed only if speaker/timing quality survives - dyadic semantic detectors blocked - response-latency interpretation downgraded - explicit caveat attached - do not merge these meetings into the same baseline stream as clean single-language meetings L2-specific caution Do not treat slower response, lower floor time, or longer pauses as direct evidence of suppression. These features must flow through: - self-history baseline - meeting-type baseline - locale pack - language-regime tag Important baseline rule: A speaker’s Japanese-only meetings and English-heavy meetings must not automatically share one undifferentiated baseline. ================================================== 7. PLATFORM-SPECIFIC IMPLEMENTATION NOTES ================================================== Microsoft Teams - Treat as strongest first-class source for turn-level analysis. - Ingest spoken-language setting if available. - Capture whether language mismatch correction happened. - Preserve participant identity-hidden state if surfaced. - For translated transcription, treat original transcript as canonical; do not run detectors on translated text. - Because language can be updated during a meeting, version the meeting language regime over time if exposed by the connector. Zoom - Separate cloud audio transcript support from AI Companion marketing/support pages. - Canonical artifact is transcript/VTT + timestamps. - Unknown speakers must remain explicit until resolved; do not auto-guess speaker identity across many meetings without strong evidence. - Capture selected original language if changed/regenerated. - Add stronger audio-quality and language-quality caveats because Zoom itself warns transcription quality varies. Google Meet - Capture transcript-start state, host/co-host control state, and whether transcription auto-start was enabled. - Treat “take notes for me” as separate from transcript artifact. Meeting notes are not canonical detector input. - Because Google explicitly says the notes feature supports one language at a time and not multiple languages in the same meeting, mixed-language meetings should be downgraded aggressively. - Until speaker-attribution guarantees are validated in implementation, treat Meet as conservative for speaker-dependent directional detectors. ================================================== 8. STORAGE / SECURITY / AUDIT IMPLICATIONS OF INPUT SUBSTRATE ================================================== The substrate layer is also where privacy failure happens. Recommended storage split Raw layer (isolated, short-lived) - original transcript files - raw turn text - raw speaker labels - parser outputs - regeneration / correction artifacts - access very restricted Analytics layer (medium retention) - turn metrics - aggregated meeting metrics - detector inputs stripped to necessity - confidence bands Review-worthy event layer - bounded context window only - trigger explanation - evidence grade - traceable references - no unnecessary full-meeting exposure Audit layer - ingestion events - parser version - detector version - suppression reasons - who accessed what - who changed language / speaker mappings if editable Critical anti-retaliation rule Private awareness actions must not create employer-visible side effects. That includes: - opening one’s own pattern page - building a private draft - creating private evidence material - reviewing one’s own flagged events Security logging can exist. Business analytics visibility of those actions must not. ================================================== 9. RECOMMENDED CONFIDENCE / ABSTENTION MODEL ================================================== Do not use one opaque composite confidence. Use explicit bands. Recommended evidence dimensions - substrate_quality - speaker_identity_quality - text_quality - language_regime_stability - meeting_type_confidence - sample_sufficiency Recommended output bands - A = detector-grade - B = usable with caveat - C = observational only - D = suppressed / abstain Example rule Topic-credit ignored-turns: - requires A/B text quality - requires A speaker quality - requires single-language stable regime - requires validated semantic model for that language If any fail -> D (suppressed) Example rule Speaking share: - requires B timestamps - requires B speaker quality - text quality can be D If timestamps or speaker quality fail -> D Else A/B/C depending on sample and meeting type UI / API requirement Every surfaced output must carry: - evidence band - reason codes - caveat flags - abstention explanation when blocked ================================================== 10. WHAT SHOULD HAPPEN WHEN QUALITY IS WEAK? ================================================== Not everything should fail the same way. Case 1 — transcript weak, speaker strong Allowed: - floor time - interruption if overlap segmentation good - turn graph Blocked: - question-response - topic credit - agreement asymmetry Case 2 — speaker weak, transcript strong Allowed: - maybe very coarse meeting-level text observations in experimental lane only Blocked: - person-linked and dyadic detectors - manager-target concentration logic Case 3 — mixed-language meeting Allowed: - coarse observational metrics with strong caveats Blocked: - semantic attribution detectors - naive latency-based social interpretation Case 4 — unsupported meeting type Allowed: - descriptive metrics only Blocked: - review-worthy events by default Case 5 — sparse sample Allowed: - “not enough comparable exposure yet” Blocked: - person-level risk narrative Absolute rule: Low-confidence outputs should be suppressed, downgraded, or explanation-labeled. They must not be shown as if they are equally trustworthy. ================================================== 11. TECHNICAL ACCEPTANCE CRITERIA FOR INPUT SUBSTRATE ================================================== A. Schema / ingestion - System supports normalized ingestion for Teams, Zoom, and Meet. - Every meeting record stores platform, transcript artifact refs, speaker refs, language regime, and quality profile. - Parser failures are explicit states, not silent drops. B. Quality gating - No detector runs before substrate gating completes. - Detector eligibility is stored per meeting and per detector. - Mixed-language, low-diarization, and low-text-confidence states are first-class variables. C. Detector discipline - Structural-only detectors and hybrid detectors are separated in code and docs. - Hybrid detectors cannot run unless semantic-lane prerequisites are met. - Suppression reasons are machine-readable and audit-loggable. D. Baselines - Baselines are segmented by meeting type and language regime. - Language-heavy meetings do not contaminate unrelated baseline streams. - Sparse data triggers abstention rather than fabricated certainty. E. Privacy / governance - Raw transcripts are isolated from normal product browsing. - Private employee awareness actions are not surfaced to employer-side analytics. - Audit logs exist for ingestion, mapping edits, and detector execution. F. QA / validation - Test pack includes: - clean single-language healthy meeting - overlap-heavy low-quality audio meeting - low-diarization meeting - multilingual / code-switch meeting - facilitator-heavy meeting - incident bridge - standup - sparse-data meeting - Each test must assert expected detector suppression / downgrade behavior, not only positive detection. ================================================== 12. RECOMMENDED BUILD ORDER ================================================== Phase 1 — Normalize the substrate - canonical schema - platform parsers - quality profile object - platform/language support registry Phase 2 — Build hard gates - substrate presence gate - parser integrity gate - speaker-quality gate - text-quality gate - language-regime gate - meeting-type gate - sample sufficiency gate Phase 3 — Release strict structural lane - speaking share - interruption / overlap / truncation - turn graph - response latency with caveats - longitudinal drift - evidence bands + abstention Phase 4 — Add normalization hardening - meeting-type priors - locale packs - language-regime baseline split - confound surfaces Phase 5 — Experimental semantic lane - unanswered-question rate - topic-credit ignored-turns - maybe agreement asymmetry only after serious validation ================================================== 13. BRUTAL BOTTOM LINE ================================================== From the dev perspective, the input-substrate problem is this: Kashi is not building “meeting analytics.” Kashi is building a detector system whose outputs are only defensible if every detector is conditional on substrate quality. So the correct engineering posture is: - platform-aware - language-aware - meeting-type-aware - confidence-explicit - abstention-friendly - suppression-by-default when the substrate is weak If the team does not build that layer, the product will overstate certainty exactly where the data is dirtiest. That is not a minor bug. That is the core failure mode. ================================================== 14. SOURCE BASIS USED FOR THIS MEMO ================================================== Internal Kashi materials - Kashi — Progress & Project Overview (2026-04-21) - Transparency That Drives Institutional Accountability / meeting_governance_ai_concept_note - Kashi cross-cultural / multilingual strategy memo - Kashi Measurement-Science Research Memo Official platform documentation checked - Microsoft Support: View live transcription in Microsoft Teams meetings - Microsoft Support: Record a meeting in Microsoft Teams - Microsoft Support: Hide your identity in meeting captions and transcripts in Microsoft Teams - Google Meet Help: Use transcripts with Google Meet - Google Meet Help: Take notes for me in Google Meet - Google Meet Help: Learn about Speech Translation - Zoom Support: Using audio transcription for cloud recordings - Zoom Support / Zoom docs: AI Companion language support and data-handling materials Note on source handling - This memo uses official platform docs for current platform capability claims. - It uses internal Kashi memos for product critique, gating doctrine, and architectural implications. - Where the official docs were weaker or less explicit than other platforms (especially around some Google Meet speaker-attribution details), the recommendation is intentionally conservative rather than assumptive.