Kashi — Baseline / Calibration Perspective Technical research memo for developers Date: 2026-04-21 Purpose Turn the baseline / calibration question into concrete technical design guidance for the Kashi team. Core question Compare against what: own history, role baseline, meeting-type baseline, dyad baseline, subgroup baseline? How long is the baseline window? What resets a baseline when org structure changes? How do locale and language conditions affect the baseline stream? Bottom line Baseline / calibration is not a reporting detail. It is part of the validity model. Kashi is already directionally right to center per-speaker own-baseline calibration instead of naive team averages. But that is not enough. If Kashi stays at “speaker vs own 90-day rolling average,” the system will remain under-calibrated for real enterprise use. The technically defensible direction is: 1) no single universal baseline; 2) baseline stack, not baseline singular; 3) comparable exposure, not raw meeting count; 4) baseline forking/resets when context changes materially; 5) locale/language-segmented streams; 6) confidence + abstention when calibration support is weak. Brutal conclusion: Kashi should treat calibration as a first-class subsystem with its own data model, state transitions, quality gates, and test plan. If not, the product will look more certain than it actually is. ====================================================================== 1. Why this matters technically ====================================================================== Kashi’s current materials already state that per-speaker baseline calibration is central and that the current system compares a person to their own 90-day rolling baseline rather than to team average. That is the right instinct because it directly reduces obvious confounds such as introversion, L2 status, and chair-role effects. But the current docs also show why this is incomplete: - meeting type changes the meaning of the same structural signal; - Japanese silence norms and L2 effects already appear as named landmines; - some detectors are not equally “purely structural” in practice; - transcript quality, diarization quality, and mixed-language conditions directly affect what the detector is even measuring. So the actual calibration problem is multi-axis: speaker x detector x meeting type x role entitlement x dyad/subgroup x locale/language x input-quality regime x time If Kashi does not model these axes explicitly, it will silently blend incompatible interaction regimes into one fake baseline. That is exactly how a system becomes “deterministic” but still not valid. ====================================================================== 2. Critical finding: calibration must be detector-specific ====================================================================== Not all detectors live on the same evidentiary substrate. A. Mostly structural detectors - intrusive interruption - floor-time / share distribution - turn-count / duration distributions - some parts of chilling-delta These depend mostly on timestamps, overlap, speaker attribution, and turn graph structure. B. Structural + contextual detectors - response latency interpretations - chilling-delta significance - directive concentration - reciprocity / takeover patterns These stay mostly structural but depend much more on meeting type, role, and locale. C. Hybrid / semantically dependent detectors - unanswered-question rate if “substantive response” is required - topic-credit ignored-turns if similarity or semantic recovery is used - agreement-asymmetry if “position shift” is inferred rather than explicitly tagged These are not calibrated the same way as raw overlap or airtime. Their confidence surface is wider and more fragile. Implication: Kashi should not have one single “confidence” or “baseline sufficiency” rule for all detectors. Each detector family needs: - its own minimum support requirements, - its own quality gates, - its own abstention rules, - and its own baseline hierarchy. ====================================================================== 3. Compare against what: recommended baseline hierarchy ====================================================================== The right answer is not “team average” and not even just “own history.” It is a baseline stack. 3.1 Own-history baseline Use for: - detecting change relative to the same speaker’s prior behavior or treatment. Good for: - introversion, - natural talkativeness, - personal style, - stable L2 patterns, - recurring dyadic patterns. Weakness: - it can normalize chronic dysfunction if the person has been suppressed for a long time. - it becomes misleading when the meeting regime changes. Rule: own-history stays the primary anchor, but never the only anchor. 3.2 Meeting-type baseline Use for: - interpreting the same metric differently across standup vs design critique vs 1:1 vs incident bridge vs training vs exec review. This should be mandatory. A high floor-time Gini in training is not the same thing as a high floor-time Gini in a weekly sync. A fast redirect in an incident bridge is not the same thing as a redirect in a brainstorm. Rule: No cross-type pooling for risk interpretation. If meeting_type is unknown or low-confidence, compute observational metrics only; do not create review-worthy events by default. 3.3 Role-entitlement baseline Use for: - chair / facilitator, - presenter, - incident commander, - trainer, - attendee, - interviewer, - account lead, - decision-maker vs observer. Reason: some roles are structurally entitled to more turns, more redirects, or more questioning. Without role-aware baselines, Kashi will repeatedly “discover” legitimate facilitation. Rule: Compare speaker vs own-history within role entitlement, not just within meeting type. 3.4 Dyad baseline Use for: - directional asymmetry, - repeated targeting, - speaker A’s treatment of speaker B relative to A’s treatment of others, - whether B’s post-event chilling is concentrated after A. This is especially important because the product is about repeated asymmetry, not just loud meetings. Rule: dyad baselines should be first-class for interruption concentration, unanswered-question concentration, topic-credit capture, and post-turn chilling patterns. 3.5 Within-meeting peer baseline Use for: - selective treatment inside the current meeting. - example: one speaker is interrupted 8 times while peers are interrupted 0–1 times. This is the fastest local sanity check, but not enough by itself for longitudinal claims. Rule: within-meeting peer comparison can support event construction, but person-level narratives still require repeated comparable exposure. 3.6 Subgroup baseline Use carefully for: - operational subgroup comparison, such as seniority band, function, language status, or recurring internal subgroup. Use cases: - are juniors consistently getting shorter answer windows than seniors? - are non-native speakers receiving longer unanswered-question runs than native speakers? - is one office or cross-functional subgroup systematically treated differently? Warning: Do not casually expose protected-class comparisons or sensitive subgroup analytics in buyer-facing views. This must stay governance-bound and ethics-reviewed. Rule: subgroup baselines are useful for internal validity analysis and governance-layer review, but should be suppressed or heavily thresholded in ordinary operational views. 3.7 Locale/language baseline Mandatory. A speaker’s Japanese-only meetings and English-heavy meetings are not one behavioral stream. Likewise, mixed Japanese-English, Mandarin-English, or Cantonese-English meetings are not equivalent to clean monolingual sessions. Rule: same speaker, different language regime = separate baseline stream or at least tagged substream. ====================================================================== 4. Recommended baseline model ====================================================================== Do not store a single rolling average. Store a baseline registry keyed by context. Suggested key shape: baseline_key = { tenant_id, subject_scope, # speaker / dyad / subgroup / team detector_id, meeting_type, role_schema, locale_pack, language_pack, platform_family, internal_vs_external, recurrence_type } Notes: - not every field is required for every detector; - some keys can fall back hierarchically if sample support is weak; - platform_family matters because upstream transcript quality differs by platform and feature; - language_pack should distinguish at least monolingual vs mixed-language regime. Suggested baseline hierarchy resolution: 1. exact match stream 2. same detector + same meeting type + same role + same language pack 3. same detector + same meeting type + same locale pack 4. same detector + same role within same locale pack 5. speaker global stream (observational only; not strong enough for risk interpretation) 6. abstain Important: Fallback should degrade confidence sharply. It should not preserve the illusion that a weak fallback is equivalent to a strong exact-match baseline. ====================================================================== 5. How long should the baseline window be? ====================================================================== There should not be one window. Recommended multi-window design: 5.1 Short recency window: 30 days Use for: - recent change detection, - acute drift, - current pattern surfacing, - monitoring after an event cluster. 5.2 Primary working baseline: 90 days Use for: - main personal calibration, - dyadic continuity, - review-worthy event scoring, - current product narrative consistency. This aligns with Kashi’s existing deck and is a reasonable primary anchor. 5.3 Longer historical context: 180 days Use for: - drift detection, - whether a recent change is actually meaningful, - distinguishing long-term chronic pattern vs temporary fluctuation, - explaining resets or regime shifts. 5.4 Event-local micro-baseline Use for detectors like chilling-delta: - pre-event local window, - post-event local window, - same-speaker expected short-term contribution pattern. This should not be confused with the 90-day person baseline. 5.5 Comparable-exposure rule Meeting count alone is not sufficient. Five meetings with almost no speaking opportunity are weaker than three meetings with dense comparable interaction. Recommended support checks should include: - number of comparable meetings, - number of meaningful turns, - number of dyadic opportunities, - number of detector-eligible moments, - spread across time, - quality-weighted exposure. Recommended product rule: No person-level interpretation unless both: - minimum comparable meetings threshold is met, and - minimum interaction-opportunity threshold is met. Cold start: Current “skip if <5 meetings” is directionally fine for some detectors, but too blunt as a universal rule. Support should be detector-specific and exposure-based. ====================================================================== 6. What should reset or fork a baseline? ====================================================================== The right question is not always “reset or not.” Often the correct move is baseline forking. Fork = preserve old baseline history, but start a new active stream. Reset = discard the active baseline for scoring purposes and rebuild. Decay = retain old observations but sharply downweight them. Recommended triggers: 6.1 Team / manager structure change Trigger examples: - manager_id changes, - team_id changes, - reporting line changes, - recurring roster changes beyond threshold, - one or more central meeting actors disappear or appear. Why it matters: interaction norms are partly social, not just personal. Recommended action: fork the stream. Keep old stream for historical explanation, but do not continue scoring new meetings against it as if nothing changed. 6.2 Role change Trigger examples: - attendee becomes facilitator, - IC role assigned, - presenter vs non-presenter mode, - trainer assignment, - promotion into people-manager role. Recommended action: fork role-aware baseline stream. 6.3 Meeting-type regime change Trigger examples: - weekly sync becomes escalation review, - project review becomes client-facing review, - standup converted to incident mode. Recommended action: do not treat as same stream. Either fork or route to different baseline family immediately. 6.4 Language / locale change Trigger examples: - the speaker shifts from Japanese-heavy to English-heavy meetings, - mixed-language rate rises materially, - office / country changes, - meeting population shifts from domestic to regional. Recommended action: start separate language/locale stream. Do not blend. 6.5 Platform / transcript substrate change Trigger examples: - Zoom -> Teams, - caption engine version change, - human transcript vs ASR transcript, - diarization provider change. Reason: baseline changes may partly reflect substrate change rather than human behavior. Recommended action: platform tag becomes part of the confidence model. If the substrate change is large, fork or at least start a caution band until stability is re-established. 6.6 Long inactivity gap Trigger examples: - no comparable meetings for 60–90+ days, - leave of absence, - project rollover. Recommended action: decay aggressively or soft-reset. Do not pretend an old baseline is still current after a long inactive gap. 6.7 Step-change detection Trigger examples: - abrupt persistent shift in speaking share, - abrupt persistent shift in response latency, - abrupt persistent shift in interruption load after stable history. Recommended action: open “possible regime change” state. Do not instantly overwrite the baseline. Hold both old and emerging baseline candidates until enough evidence accumulates. ====================================================================== 7. Locale and language conditions must segment the stream ====================================================================== This is not optional. The cross-cultural memo is already pointing the right way: spoken language cannot be treated as a metadata footnote. What should be recorded per meeting: - source platform, - source feature used, - primary spoken language, - secondary spoken languages, - monolingual vs mixed-language flag, - likely code-switching flag, - locale pack, - transcript confidence, - diarization confidence, - overlap quality, - whether external participants are present. Critical rule: A speaker’s Japanese-only meetings and English-heavy meetings must not be pooled into one undifferentiated baseline stream. Why: - pause tolerance differs, - overlap norms differ, - L2 load affects latency and participation, - code-switching affects ASR reliability, - even “silence” changes meaning by interaction regime. Recommended product behavior: - confidence down-rank or suppress in mixed-language / code-switching heavy sessions; - show caveat labels when language regime materially affects interpretation; - keep language/locale-specific baseline tags in storage and scoring. ====================================================================== 8. Quality gates are calibration gates ====================================================================== Calibration is only as good as the input substrate. Needed first-class gates: - transcript-confidence gating, - speaker-diarization-confidence gating, - overlap-quality flag, - mixed-language / code-switching flag, - platform support matrix, - detector-specific support matrix. Technical implication: If diarization is weak, directional interruption metrics become fragile. If transcript confidence is weak, semantic or hybrid detectors become fragile. If overlap quality is weak, truncation detection becomes fragile. If code-switching is heavy, latency and unanswered-question logic may be contaminated by ASR failure. Recommended output policy: - high quality -> scoreable - medium quality -> scoreable but confidence-downgraded - low quality -> observational only - unsupported -> suppress risk interpretation entirely Do not quietly emit the same kind of event object across all four states. ====================================================================== 9. Confidence, evidence grade, and abstention ====================================================================== Kashi should stop acting as if all scored meetings are equally interpretable. Recommended confidence dimensions: - detector confidence, - calibration support confidence, - input quality confidence, - meeting-type confidence, - role-schema confidence, - language/locale confidence. Example composite explanation: confidence = function( detector_support, comparable_exposure, transcript_quality, diarization_quality, meeting_type_confidence, baseline_match_quality ) Important: Do not flatten this to a magic number without preserving reason codes. Recommended user-visible reason codes: - enough comparable meetings - low comparable exposure - meeting type unsupported - role metadata incomplete - mixed-language uncertainty - diarization weak - observational only Abstention rules should exist at three levels: 1. detector abstention 2. event-construction abstention 3. escalation abstention Example: You may still show “Kenji interrupted Aiko 9 times” as an observation. But you should abstain from constructing a review-worthy directional-risk event if: - meeting type is unsupported, - role tagging is weak, - diarization is weak, - comparable exposure is too low. That is not a product weakness. That is what seriousness looks like. ====================================================================== 10. Recommended technical architecture ====================================================================== 10.1 Meeting-context extraction layer Before scoring, derive: - meeting_type - meeting_type_confidence - internal_vs_external - recurrence_type - decision_mode - role schema per participant - locale_pack - language_pack - mixed_language_flag - platform_family - transcript_confidence - diarization_confidence - overlap_quality 10.2 Calibration registry service Responsibilities: - store baseline streams, - update recency-weighted statistics, - detect support sufficiency, - detect regime-change triggers, - manage forks/resets, - expose comparable-exposure summaries. 10.3 Detector router For each detector: - choose eligible baseline family, - attach detector-specific support checks, - attach detector-specific fallback rules. 10.4 Risk interpreter Only this layer may create: - review-worthy events, - trend narratives, - escalation-ready summaries. It should consume: - raw metric output, - baseline comparisons, - quality gates, - confidence dimensions, - abstention rules. 10.5 Audit trail Every interpreted output should retain: - baseline stream used, - fallback level used, - confidence reasons, - detector version, - context tags, - whether any reset/fork logic was active. Without this, you cannot really contest calibration. ====================================================================== 11. Suggested data model additions ====================================================================== A. meetings table additions - meeting_type - meeting_type_confidence - internal_vs_external - recurrence_type - decision_mode - platform_family - transcript_source_feature - transcript_confidence - diarization_confidence - overlap_quality - primary_language - secondary_languages - mixed_language_flag - locale_pack - calibrated_status # scoreable / downgraded / observational_only / suppressed B. participant_context table - speaker_id - meeting_id - role_schema - role_confidence - presenter_flag - facilitator_flag - incident_commander_flag - trainer_flag - external_participant_flag - l2_self_marked_or_inferred_flag (careful governance) C. baseline_stream table - baseline_stream_id - detector_id - subject_scope_type # speaker / dyad / subgroup / team - subject_scope_id - context_key_hash - baseline_status # active / forked / deprecated / rebuilding - support_meeting_count - support_turn_count - support_exposure_count - first_seen_at - last_seen_at - last_reset_reason - last_fork_reason D. baseline_snapshot table - stream_id - window_30_stats - window_90_stats - window_180_stats - quality_weighted_stats - drift_flags - regime_change_flags E. event_interpretation table - event_id - baseline_stream_id - baseline_match_level - detector_confidence - calibration_confidence - input_quality_confidence - abstention_reason_codes - event_state # observational / review_candidate / suppressed ====================================================================== 12. Algorithm sketch ====================================================================== For each meeting M: derive_context(M) quality_state = assess_input_quality(M) support_state = determine_platform_language_support(M) for each detector D: raw = run_detector(D, M) stream = resolve_best_baseline_stream( detector=D, context=M.context, subject_scope=D.subject_scope ) support = assess_comparable_exposure(stream, D) regime = assess_regime_change(stream, M.context) if regime == "fork_required": stream = fork_stream(stream, reason=regime.reason) interpretation = interpret( raw_metrics=raw, baseline_stream=stream, support=support, quality_state=quality_state, context=M.context, detector=D ) if interpretation.abstain: emit_observational_metrics_only() else: emit_interpreted_output() ====================================================================== 13. Critical risks if this is done badly ====================================================================== 1. Baseline contamination Japanese meetings, English-heavy meetings, incident bridges, and training sessions get blended into one stream. 2. Chronic-harm normalization Own-history baseline treats long-term suppression as “normal self.” 3. Sparse-data overclaiming Five mostly passive meetings become a fake person-level narrative. 4. Silent regime change Team or manager change happens but scoring keeps using old baseline. 5. Substrate drift masquerading as behavior drift ASR/diarization/platform change looks like human change. 6. Manager gaming Visible metrics improve while pressure shifts into sequencing, agenda control, 1:1s, or offstage channels. 7. Calibration opacity A user cannot tell which baseline they were compared against or why. 8. Premature subgroup analytics Sensitive group comparison is surfaced without enough governance or sample support. ====================================================================== 14. Recommended MVP sequencing ====================================================================== Do not try to perfect everything at once. Phase A — Must have before serious pilot - Keep own-baseline calibration. - Add meeting_type and meeting_type_confidence. - Add role schema capture for key roles. - Add language/locale tags. - Add transcript/diarization confidence gating. - Add observational-only fallback. - Add per-detector abstention reasons. - Add baseline fork triggers for team/manager/language shifts. Phase B — Strong next layer - Add exact baseline registry with hierarchical fallback. - Add dyad baselines for repeated directional patterns. - Add comparable-exposure scoring. - Add 30/90/180 multi-window snapshots. - Add regime-change detection. Phase C — Later hardening - Add subgroup baselines under governance controls. - Add platform-family-specific reliability priors. - Add adaptive decay / state-space or Bayesian updates if needed. - Add richer cross-locale packs. - Add post-deployment displacement / gaming monitoring. ====================================================================== 15. Acceptance criteria for dev planning ====================================================================== A. Baseline identity [ ] A scored output can show exactly which baseline family was used. [ ] Cross-type pooling is blocked for risk interpretation. [ ] Same-speaker Japanese-only and English-heavy meetings do not pool by default. B. Quality gating [ ] Low transcript confidence downgrades or suppresses interpretation. [ ] Low diarization confidence downgrades or suppresses directional detectors. [ ] Mixed-language or code-switch-heavy meetings can be routed to observational-only mode. C. Support sufficiency [ ] “<5 meetings” is not the only gate; comparable exposure is also checked. [ ] A person-level output cannot be created from sparse passive attendance alone. [ ] Unsupported meeting types do not create review-worthy events. D. Reset/fork behavior [ ] Team/manager/role/language/platform changes can trigger baseline forking. [ ] Fork reason is logged and auditable. [ ] Old and new streams are distinguishable in storage and UI. E. Contestability [ ] Users/reviewers can inspect baseline match level and confidence reasons. [ ] Event interpretation retains versioned reason codes. [ ] Calibration decisions are auditable after the fact. F. Evaluation [ ] For each supported meeting type, there are healthy, borderline, harmful, and confounded cases. [ ] Confounded cases include introversion, facilitator role, L2, mixed-language, and rough-but-benign disagreement. [ ] Platform/language quality matrix is part of the test plan. ====================================================================== 16. Final recommendation ====================================================================== For Kashi, the clean answer to “compare against what?” is: Not one thing. Compare against a baseline stack: - own history, - own history within meeting type, - own history within role entitlement, - dyad baseline, - within-meeting peer context, - and locale/language-segmented streams. The clean answer to “how long is the window?” is: not one window; use 30/90/180-day layers plus detector-local micro-baselines. The clean answer to “what resets a baseline?” is: material regime change should usually fork the stream, not silently reuse it. The clean answer to “how do locale and language conditions affect the baseline stream?” is: they segment it. Japanese-heavy and English-heavy meetings should not be blended into one fake behavioral history. Most important: Kashi should promote calibration from a scoring detail to an explicit subsystem. That is one of the clearest moves available for making the product look methodologically serious instead of just rhetorically careful. ====================================================================== Source notes ====================================================================== Internal Kashi materials used - Kashi — Progress & Project Overview (2026-04-21) - Kashi Measurement-Science Research Memo - Kashi Meeting-Type Normalization Research Memo - Kashi Cross-Cultural / Multilingual Strategy Memo - Kashi Research Synthesis: Legal Defensibility, Procedural Fairness, and Governance Design - Kashi Retaliation-Risk Research Memo Key external / official sources checked - European Commission, Navigating the AI Act FAQ - NIST AI RMF 1.0 - NIST AI RMF Playbook (Map / Manage) - Google Meet Help: meeting transcript language support - Microsoft Support: multilingual speech recognition in Teams - Zoom Support: supported languages for AI Companion features / meeting questions Short external takeaways used here - high-risk / workplace AI posture remains context-and-use dependent, not marketing-label dependent - AI risk management guidance favors context analysis, post-deployment monitoring, feedback/appeal mechanisms, and go/no-go discipline - upstream transcript/language support is uneven across platforms, so platform/language confidence must be part of Kashi’s calibration layer End of memo