KASHI — INPUT-SUBSTRATE PERSPECTIVE
Technical research memo for developers
Date: 2026-04-21

Purpose
Turn the “input-substrate perspective” into a dev-facing technical specification and critique.
This is not pitch copy. This is the implementation-facing view of what enters the system, what must be normalized, what must be gated, and what must be suppressed when the substrate is weak.

==================================================
0. EXECUTIVE JUDGMENT
==================================================

The input substrate is not a boring upstream detail. It is the validity boundary of the whole product.
If the ingestion layer is sloppy, Kashi does not measure workplace asymmetry. It measures transcript failure, diarization drift, platform quirks, and mixed-language confusion.

The current internal materials already imply the right direction:
- Kashi starts from meeting transcripts, speaker attribution, and meeting metadata.
- Kashi wants deterministic structural analysis where possible.
- Kashi already recognizes L2 speakers, Japanese silence norms, chair role, and multilingual conditions as confounds.
- Kashi’s multilingual memo explicitly says transcript quality and language conditions need a first-class gate.

But the current project state still has one major technical contradiction:
- The deck says “patterns, not content, not affect,” “never transcribe for analysis,” and “none read meeting content.”
- At the same time, shipped / named detectors such as unanswered-question rate, topic-credit ignored-turns, and agreement-asymmetry already require transcript interpretation or semantic comparison.

That contradiction must be resolved in code and in product docs.
Not later. Now.

There are only two honest technical options:

Option A — Strict structural MVP
Keep MVP genuinely structural-only.
Allowed detector family:
- speaking share / floor-time metrics
- interruption / overlap / truncation
- turn counts / turn durations
- response latency
- directed turn graph
- longitudinal drift
- simple chilling-delta if trigger definition remains structural

Remove or disable in MVP:
- unanswered-question rate
- topic-credit ignored-turns
- agreement-asymmetry
- anything depending on “substantive response,” similarity, stance shift, or semantic interpretation

Option B — Constrained hybrid pipeline
Admit that some detectors are hybrid, not purely structural.
Then build:
- structural detector lane
- transcript-semantic detector lane
- detector eligibility gating per lane
- per-detector confidence and abstention rules
- stronger platform/language restrictions

What is not acceptable:
pretending semantic detectors are still “metadata only.”
That is technically false and will get shredded by any serious reviewer.

Recommended path:
- MVP production lane = strict structural core
- experimental lane = constrained hybrid detectors behind explicit gating and caveats

==================================================
1. WHICH PLATFORMS SHOULD BE FIRST-CLASS?
==================================================

Current answer: Zoom, Microsoft Teams, Google Meet.
That is already consistent with the concept note and current progress deck.
But “supported” should not mean “one badge on a slide.” It should mean “documented input contract + tested parser + detector eligibility matrix.”

Recommended ranking for first-class production support:

Tier 1 — Microsoft Teams
Why:
- official docs explicitly state live transcription includes speaker names and timestamps
- official docs expose spoken-language selection and language-mismatch correction flow
- official docs expose participant notice and optional identity hiding in captions/transcripts
- official docs document original transcript vs live translated transcription behavior

Technical implication:
Teams is the cleanest first-class substrate for person-linked turn analysis because the platform contract around speaker names, timestamps, spoken-language handling, and user notice is the most explicit.

Tier 2 — Zoom
Why:
- official docs explicitly state cloud audio transcription generates VTT transcript files with timestamps
- transcripts can contain unknown speakers whose names are manually edited later
- supported languages for cloud audio transcripts are documented
- Zoom explicitly warns transcription quality depends on language and audio quality

Technical implication:
Zoom is usable, but speaker identity quality and transcript quality must be treated as more conditional than Teams.
Do not conflate Zoom AI Companion language breadth with cloud-audio-transcript detector readiness.
A language supported somewhere in AI Companion is not automatically detector-grade for Kashi.

Tier 3 — Google Meet
Why:
- official docs clearly document transcript availability, supported transcript languages, storage location, host/co-host control, default settings, and participant warning when auto-start is enabled
- official docs for “take notes for me” explicitly say only one spoken language at a time is supported and multiple languages in the same meeting are not supported
- however, Google’s official transcript docs are less explicit than Teams about speaker-attribution guarantees in the material reviewed here

Technical implication:
Meet is valid as a first-class platform for transcript ingestion, but should be treated as the most conservative platform for speaker-attribution-dependent detectors unless implementation testing proves otherwise.

Recommended engineering policy:
- first-class does not mean identical detector availability
- each platform gets its own detector-eligibility matrix
- every new platform/feature pair must pass a substrate validation checklist before it becomes production-grade

==================================================
2. WHAT RAW ARTIFACTS SHOULD KASHI INGEST?
==================================================

Minimum ingestion artifacts

A. Transcript-turn artifact
Required fields:
- source_platform
- source_meeting_id
- org_id / tenant_id
- transcript_segment_id
- speaker_source_id
- speaker_display_name if available
- start_timestamp_ms
- end_timestamp_ms
- raw_text
- transcript_language
- segment_confidence if platform provides it
- source_file_reference (VTT/SRT/TXT/JSON/etc.)

B. Speaker artifact
Required fields:
- speaker_source_id
- normalized_speaker_id (internal)
- display_name
- participant_role_if_known (host, co-host, organizer, attendee, external)
- identity_confidence
- diarization_confidence if available
- alias resolution history

C. Meeting artifact
Required fields:
- meeting_id
- org_id
- platform
- recording_id / transcript_id
- start_at
- end_at
- duration_minutes
- organizer_id
- host_ids
- participant_roster
- participant_count
- internal_vs_external flags
- meeting_title
- calendar_event_id if available
- meeting_type candidate
- transcript_enabled_by
- recording_enabled_by
- notice_mode (manual / automatic / unknown)

D. Language regime artifact
Required fields:
- configured_spoken_language
- detected_primary_language
- detected_secondary_languages
- single_language_flag
- mixed_language_flag
- code_switch_likely_flag
- language_detection_confidence
- locale_pack_candidate

E. Quality artifact
Required fields:
- transcript_available_flag
- transcript_confidence_band
- speaker_attribution_confidence_band
- overlap_quality_band
- audio_quality_proxy if available
- timestamp_integrity_flag
- parser_warning_codes
- eligibility_status_by_detector

F. Calendar / context artifact
Ingest only what helps normalization and permissions.
Useful fields:
- recurring_series_id
- invitee_count
- external_attendee_presence
- department / team scope if contractually allowed
- organizer org unit
- scheduled duration
- optional meeting labels from calendar or admin metadata

Do not expand ingestion just because you can.
The point is detector validity, not data hoarding.

==================================================
3. CANONICAL NORMALIZED INPUT CONTRACT
==================================================

Kashi should not run detectors directly on raw platform exports.
It needs a canonical normalized schema.

Recommended canonical unit: Turn

Turn {
  turn_id
  meeting_id
  speaker_id
  start_ms
  end_ms
  duration_ms
  raw_text
  normalized_text
  source_language
  text_confidence
  diarization_confidence
  overlap_before_ms
  overlap_after_ms
  interruption_candidate_flag
  turn_sequence_index
  parser_warning_codes[]
}

MeetingEnvelope {
  meeting_id
  org_id
  platform
  started_at
  ended_at
  duration_ms
  organizer_id
  participant_ids[]
  participant_count
  meeting_type_label
  meeting_type_confidence
  locale_pack
  language_regime
  quality_profile
  raw_source_refs[]
}

QualityProfile {
  transcript_confidence_band
  diarization_confidence_band
  overlap_quality_band
  language_regime_band
  parser_integrity_band
  detector_eligibility_map
}

Do not bury quality inside one hidden scalar.
Quality must be explicit and queryable.

==================================================
4. PRE-ANALYSIS QUALITY GATES
==================================================

This should be an explicit gate pipeline, not a vague “confidence” factor at the end.

Gate 1 — Substrate presence gate
Questions:
- Is there a transcript at all?
- Are timestamps present?
- Are there speaker labels or at least stable speaker buckets?
- Is meeting duration plausible?

If fail:
- do not run person-linked detectors
- store ingestion failure state
- surface “analysis unavailable: substrate incomplete”

Gate 2 — Parser integrity gate
Questions:
- Did parsing create valid ordered turns?
- Are timestamps monotonic enough?
- Are there negative or zero durations at scale?
- Did encoding/language handling break text?

If fail:
- block all downstream analytics
- log parser defects
- send artifact to parser QA queue

Gate 3 — Speaker-attribution gate
Questions:
- Are speaker labels present consistently?
- Are there too many Unknown Speaker segments?
- Is diarization confidence below threshold?
- Is identity stitching across meetings stable enough for longitudinal analysis?

If weak:
- allow only meeting-level aggregate metrics if safe
- block dyadic / person-targeted detectors
- never run interruption-directionality or manager-to-target concentration logic on low speaker-confidence substrate

Gate 4 — Transcript-text gate
Questions:
- Is text confidence above threshold?
- Is transcript sparsity abnormal?
- Is overlap-heavy speech causing text corruption?
- Is the transcript language consistent enough for semantic interpretation?

If weak:
- structural-only detectors may still run if timestamps and speakers are good
- semantic or hybrid detectors must be suppressed

Gate 5 — Language regime gate
Questions:
- Is this single-language or mixed-language?
- Is code-switching likely?
- Does the platform officially support this spoken language for the relevant feature?
- Is this language supported only for note-taking but not transcript-grade analysis?

If mixed-language or unsupported:
- downgrade or suppress transcript-semantic detectors
- split baseline streams by language regime
- attach hard caveat to timing/latency interpretation

Gate 6 — Meeting-type gate
Questions:
- Is meeting type known with enough confidence?
- If unknown, is the fallback normalization safe?
- Is this a greylisted meeting class (incident bridge, exec review, training, client meeting) where some detectors should be observational only?

If unknown or unsupported:
- do not emit review-worthy events by default
- show observational metrics only

Gate 7 — Sample sufficiency gate
Questions:
- Is there enough comparable exposure?
- Is this person actually speaking enough to infer anything?
- Is this longitudinal signal based on comparable meeting classes and language regimes?

If insufficient:
- abstain
- do not force output
- show “insufficient comparable exposure” instead of a weak pseudo-score

==================================================
5. DETECTOR ELIGIBILITY MATRIX
==================================================

This is the most important implementation object.
Every detector needs explicit prerequisites.

A. Structural-only detectors

1) Speaking share / floor-time Gini
Requires:
- valid timestamps
- minimally reliable speaker attribution
Can survive:
- weak text quality
Cannot survive:
- speaker identity collapse

2) Overlap / truncation / intrusive interruption
Requires:
- valid turn boundaries
- strong enough diarization / overlap segmentation
Can survive:
- weak lexical transcript quality
Cannot survive:
- merged speakers, poor overlap segmentation

3) Response latency
Requires:
- valid turn sequence
- reliable timestamps
Needs caution:
- language regime
- meeting type
- culture / locale pack

4) Directed turn graph / reciprocity skeleton
Requires:
- speaker identity quality
- valid turn ordering
Does not require:
- semantic interpretation

5) Speaker baseline drift
Requires:
- stable person identity across meetings
- adequate comparable history
- meeting-type and language tags

B. Hybrid / semantic detectors

6) Unanswered-question rate
Hidden requirement:
- question detection
- "substantive response" determination
- N-turn window semantics
Meaning:
This is not purely structural.
Run only when:
- transcript quality high
- language regime single-language
- supported language
- semantic lane enabled
Otherwise:
- suppress

7) Topic-credit ignored-turns
Hidden requirement:
- semantic similarity or proposition matching
- attribution of later credit
Meaning:
This is definitely not purely structural.
Run only when:
- transcript quality high
- speaker attribution high
- single-language or very well-supported same-language segments
- semantic model calibrated for that language
Otherwise:
- suppress

8) Agreement-asymmetry / position shift logic
Hidden requirement:
- stance or position movement inference
- semantic interpretation across turns
Meaning:
This is the weakest candidate for early production.
Recommendation:
- experimental lane only
- no production review-worthy events from this detector until validated

Engineering rule:
Every detector returns:
- score
- evidence grade
- reason codes
- abstain flag
- suppression reason if blocked

No detector should silently degrade from "strong evidence" to "vibes."

==================================================
6. MIXED-LANGUAGE, CODE-SWITCHING, AND L2 HANDLING
==================================================

This cannot be an afterthought.
It must be a first-class branch in the pipeline.

Required variables
- primary spoken language
- secondary spoken language(s)
- mixed-language flag
- code-switch-likely flag
- participant-level language heterogeneity if inferable safely
- locale pack
- language-regime-specific baseline key

Recommended policy

Single-language + officially supported + good quality
- full structural lane allowed
- hybrid lane allowed only for detectors explicitly validated in that language

Single-language + officially unsupported or low-confidence language handling
- structural lane only if timestamps/speakers are solid
- hybrid lane blocked

Mixed-language or code-switch-heavy
- meeting-level structural metrics allowed only if speaker/timing quality survives
- dyadic semantic detectors blocked
- response-latency interpretation downgraded
- explicit caveat attached
- do not merge these meetings into the same baseline stream as clean single-language meetings

L2-specific caution
Do not treat slower response, lower floor time, or longer pauses as direct evidence of suppression.
These features must flow through:
- self-history baseline
- meeting-type baseline
- locale pack
- language-regime tag

Important baseline rule:
A speaker’s Japanese-only meetings and English-heavy meetings must not automatically share one undifferentiated baseline.

==================================================
7. PLATFORM-SPECIFIC IMPLEMENTATION NOTES
==================================================

Microsoft Teams
- Treat as strongest first-class source for turn-level analysis.
- Ingest spoken-language setting if available.
- Capture whether language mismatch correction happened.
- Preserve participant identity-hidden state if surfaced.
- For translated transcription, treat original transcript as canonical; do not run detectors on translated text.
- Because language can be updated during a meeting, version the meeting language regime over time if exposed by the connector.

Zoom
- Separate cloud audio transcript support from AI Companion marketing/support pages.
- Canonical artifact is transcript/VTT + timestamps.
- Unknown speakers must remain explicit until resolved; do not auto-guess speaker identity across many meetings without strong evidence.
- Capture selected original language if changed/regenerated.
- Add stronger audio-quality and language-quality caveats because Zoom itself warns transcription quality varies.

Google Meet
- Capture transcript-start state, host/co-host control state, and whether transcription auto-start was enabled.
- Treat “take notes for me” as separate from transcript artifact. Meeting notes are not canonical detector input.
- Because Google explicitly says the notes feature supports one language at a time and not multiple languages in the same meeting, mixed-language meetings should be downgraded aggressively.
- Until speaker-attribution guarantees are validated in implementation, treat Meet as conservative for speaker-dependent directional detectors.

==================================================
8. STORAGE / SECURITY / AUDIT IMPLICATIONS OF INPUT SUBSTRATE
==================================================

The substrate layer is also where privacy failure happens.

Recommended storage split

Raw layer (isolated, short-lived)
- original transcript files
- raw turn text
- raw speaker labels
- parser outputs
- regeneration / correction artifacts
- access very restricted

Analytics layer (medium retention)
- turn metrics
- aggregated meeting metrics
- detector inputs stripped to necessity
- confidence bands

Review-worthy event layer
- bounded context window only
- trigger explanation
- evidence grade
- traceable references
- no unnecessary full-meeting exposure

Audit layer
- ingestion events
- parser version
- detector version
- suppression reasons
- who accessed what
- who changed language / speaker mappings if editable

Critical anti-retaliation rule
Private awareness actions must not create employer-visible side effects.
That includes:
- opening one’s own pattern page
- building a private draft
- creating private evidence material
- reviewing one’s own flagged events

Security logging can exist.
Business analytics visibility of those actions must not.

==================================================
9. RECOMMENDED CONFIDENCE / ABSTENTION MODEL
==================================================

Do not use one opaque composite confidence.
Use explicit bands.

Recommended evidence dimensions
- substrate_quality
- speaker_identity_quality
- text_quality
- language_regime_stability
- meeting_type_confidence
- sample_sufficiency

Recommended output bands
- A = detector-grade
- B = usable with caveat
- C = observational only
- D = suppressed / abstain

Example rule
Topic-credit ignored-turns:
- requires A/B text quality
- requires A speaker quality
- requires single-language stable regime
- requires validated semantic model for that language
If any fail -> D (suppressed)

Example rule
Speaking share:
- requires B timestamps
- requires B speaker quality
- text quality can be D
If timestamps or speaker quality fail -> D
Else A/B/C depending on sample and meeting type

UI / API requirement
Every surfaced output must carry:
- evidence band
- reason codes
- caveat flags
- abstention explanation when blocked

==================================================
10. WHAT SHOULD HAPPEN WHEN QUALITY IS WEAK?
==================================================

Not everything should fail the same way.

Case 1 — transcript weak, speaker strong
Allowed:
- floor time
- interruption if overlap segmentation good
- turn graph
Blocked:
- question-response
- topic credit
- agreement asymmetry

Case 2 — speaker weak, transcript strong
Allowed:
- maybe very coarse meeting-level text observations in experimental lane only
Blocked:
- person-linked and dyadic detectors
- manager-target concentration logic

Case 3 — mixed-language meeting
Allowed:
- coarse observational metrics with strong caveats
Blocked:
- semantic attribution detectors
- naive latency-based social interpretation

Case 4 — unsupported meeting type
Allowed:
- descriptive metrics only
Blocked:
- review-worthy events by default

Case 5 — sparse sample
Allowed:
- “not enough comparable exposure yet”
Blocked:
- person-level risk narrative

Absolute rule:
Low-confidence outputs should be suppressed, downgraded, or explanation-labeled.
They must not be shown as if they are equally trustworthy.

==================================================
11. TECHNICAL ACCEPTANCE CRITERIA FOR INPUT SUBSTRATE
==================================================

A. Schema / ingestion
- System supports normalized ingestion for Teams, Zoom, and Meet.
- Every meeting record stores platform, transcript artifact refs, speaker refs, language regime, and quality profile.
- Parser failures are explicit states, not silent drops.

B. Quality gating
- No detector runs before substrate gating completes.
- Detector eligibility is stored per meeting and per detector.
- Mixed-language, low-diarization, and low-text-confidence states are first-class variables.

C. Detector discipline
- Structural-only detectors and hybrid detectors are separated in code and docs.
- Hybrid detectors cannot run unless semantic-lane prerequisites are met.
- Suppression reasons are machine-readable and audit-loggable.

D. Baselines
- Baselines are segmented by meeting type and language regime.
- Language-heavy meetings do not contaminate unrelated baseline streams.
- Sparse data triggers abstention rather than fabricated certainty.

E. Privacy / governance
- Raw transcripts are isolated from normal product browsing.
- Private employee awareness actions are not surfaced to employer-side analytics.
- Audit logs exist for ingestion, mapping edits, and detector execution.

F. QA / validation
- Test pack includes:
  - clean single-language healthy meeting
  - overlap-heavy low-quality audio meeting
  - low-diarization meeting
  - multilingual / code-switch meeting
  - facilitator-heavy meeting
  - incident bridge
  - standup
  - sparse-data meeting
- Each test must assert expected detector suppression / downgrade behavior, not only positive detection.

==================================================
12. RECOMMENDED BUILD ORDER
==================================================

Phase 1 — Normalize the substrate
- canonical schema
- platform parsers
- quality profile object
- platform/language support registry

Phase 2 — Build hard gates
- substrate presence gate
- parser integrity gate
- speaker-quality gate
- text-quality gate
- language-regime gate
- meeting-type gate
- sample sufficiency gate

Phase 3 — Release strict structural lane
- speaking share
- interruption / overlap / truncation
- turn graph
- response latency with caveats
- longitudinal drift
- evidence bands + abstention

Phase 4 — Add normalization hardening
- meeting-type priors
- locale packs
- language-regime baseline split
- confound surfaces

Phase 5 — Experimental semantic lane
- unanswered-question rate
- topic-credit ignored-turns
- maybe agreement asymmetry only after serious validation

==================================================
13. BRUTAL BOTTOM LINE
==================================================

From the dev perspective, the input-substrate problem is this:
Kashi is not building “meeting analytics.”
Kashi is building a detector system whose outputs are only defensible if every detector is conditional on substrate quality.

So the correct engineering posture is:
- platform-aware
- language-aware
- meeting-type-aware
- confidence-explicit
- abstention-friendly
- suppression-by-default when the substrate is weak

If the team does not build that layer, the product will overstate certainty exactly where the data is dirtiest.
That is not a minor bug.
That is the core failure mode.

==================================================
14. SOURCE BASIS USED FOR THIS MEMO
==================================================

Internal Kashi materials
- Kashi — Progress & Project Overview (2026-04-21)
- Transparency That Drives Institutional Accountability / meeting_governance_ai_concept_note
- Kashi cross-cultural / multilingual strategy memo
- Kashi Measurement-Science Research Memo

Official platform documentation checked
- Microsoft Support: View live transcription in Microsoft Teams meetings
- Microsoft Support: Record a meeting in Microsoft Teams
- Microsoft Support: Hide your identity in meeting captions and transcripts in Microsoft Teams
- Google Meet Help: Use transcripts with Google Meet
- Google Meet Help: Take notes for me in Google Meet
- Google Meet Help: Learn about Speech Translation
- Zoom Support: Using audio transcription for cloud recordings
- Zoom Support / Zoom docs: AI Companion language support and data-handling materials

Note on source handling
- This memo uses official platform docs for current platform capability claims.
- It uses internal Kashi memos for product critique, gating doctrine, and architectural implications.
- Where the official docs were weaker or less explicit than other platforms (especially around some Google Meet speaker-attribution details), the recommendation is intentionally conservative rather than assumptive.