Kashi — Longitudinal Aggregation Perspective Technical research memo for developers Date: 2026-04-21 Purpose Turn the “longitudinal aggregation” question into a concrete engineering and measurement design for Kashi. Scope This memo focuses on the aggregation layer that sits above meeting-level detector output. It is written for implementation planning, not pitch copy. Bottom line Kashi only becomes meaningfully “Kashi” at the longitudinal layer, but this is also where the system can most easily become statistically sloppy or politically dangerous. The aggregation layer should not be “sum scores across meetings and draw a line chart.” It should be a confidence-aware evidence accumulation system that: 1) aggregates separately by unit (person, dyad, team, subgroup), 2) only compares like with like, 3) weights by exposure and input quality, 4) caps the influence of any single meeting, 5) uses recency-aware drift detection rather than naïve averaging, 6) abstains when comparable exposure is too weak. Internal anchor Kashi’s current materials already establish the core shape: longitudinal aggregation over 30 / 90 / 180-day windows, at least per person / per dyad / per team, calibrated to each speaker’s own baseline rather than team average. They also already say a single meeting is noise, a 90-day pattern is signal, and at least one detector already has a cold-start skip rule for fewer than 5 meetings. The measurement-science and meeting-type memos then tighten this further: person-level interpretation requires repeated comparable exposure, meeting count alone is insufficient, meeting-type normalization is part of validity, and outputs should carry evidence grade and abstention when evidence is weak. 1. What the longitudinal layer is actually for The aggregation layer has five jobs: A. Convert meeting-level observations into pattern-level evidence. Meeting-level detectors tell you that something happened in one session. Aggregation tells you whether the same thing keeps happening to the same person, from the same counterpart, under comparable conditions, often enough that it becomes review-worthy. B. Separate repeated exposure from random noise. One rough meeting should not produce a person-level risk narrative. The layer should ask: - Did this pattern recur? - In comparable meeting types? - With enough interaction opportunity? - With enough input quality? - In the same directional relationship? C. Detect drift, not only average state. A person may look “normal on average” while showing a sharp 45-day deterioration. Longitudinal logic must detect both: - chronic asymmetry (persistently bad) - trajectory change (getting worse) D. Prevent overclaiming. The layer must be where abstention happens. If the system lacks enough comparable exposure, it should compute observations but stop before constructing review-worthy pattern objects. E. Preserve contestability. Aggregation must remain decomposable. Every trend must be traceable back to: - which meetings entered the aggregate - which meetings were excluded and why - how each meeting was weighted - what the confidence or evidence grade was 2. Recommended unit-of-accumulation model Do not choose one unit. Use a stack. 2.1 Person-level stream Question answered: “How is this person’s treatment changing over time?” Use for: - employee private view - self-baseline drift - speaking-share change - chilling or unanswered-question burden over time Do not use alone for: - accusing a specific counterpart - inferring cause without dyadic support 2.2 Dyad-level stream Question answered: “Does A treat B differently over time?” This is the most important unit for directional asymmetry. Use for: - interruption directionality - unanswered-question burden from one counterpart - repeated takeover / credit capture involving the same pair - manager -> employee asymmetry patterns The dyad stream is often more probative than the person stream because it tests direction, not just burden. 2.3 Team-level stream Question answered: “Is this meeting environment structurally distorting participation?” Use for: - floor-time inequality - subgroup participation compression - team-level dominance structure - whether the problem is broader than one dyad Do not let team-level aggregates wash out targeted harm. A team can look broadly fine while one person is repeatedly suppressed. 2.4 Subgroup-level stream Question answered: “Is a class of participants getting systematically worse interaction access?” Examples: - juniors vs seniors - L2 speakers vs native/near-native majority language speakers - functional subgroup - externally tagged demographic or protected-category proxies ONLY if legally and ethically validated later; probably not MVP This should be aggregate-only and heavily privacy-constrained. 2.5 Event-family stream Question answered: “Is this detector family recurrent enough to matter?” Example: - interruption family - chilling family - ignored-turn family Useful because some detectors are sparse. Aggregating at event-family level can increase stability without flattening everything into one fake composite too early. Recommended doctrine: Kashi should aggregate at person, dyad, team, and event-family levels by default. Subgroup aggregation should exist only where privacy thresholds and governance conditions are met. 3. The core mistake to avoid Bad design: Take meeting-level detector scores, average them over 30 / 90 / 180 days, and show the result. Why this is bad: - It treats all meetings as comparable. - It gives a one-off extreme meeting too much power. - It ignores interaction opportunity. - It ignores sparse data. - It ignores input quality. - It hides whether the pattern is chronic or just recent. - It looks mathematically clean while being epistemically fake. Kashi needs accumulation, not mere averaging. 4. Recommended aggregation doctrine 4.1 Comparability gate before aggregation A meeting may enter a person/dyad trend only if it passes comparability checks. Minimum comparability fields: - meeting_type - meeting_type_confidence - role schema / role entitlement - internal vs external - language regime / multilingual flag - transcript quality - diarization quality - interaction opportunity level Hard rule: Cross-type pooling should be prohibited for risk interpretation. Weekly sync, standup, 1:1, client call, incident bridge, and training session should not enter the same inferential stream as though they were interchangeable. If meeting_type is unknown or low-confidence: - allow observational metrics - block review-worthy pattern construction by default 4.2 Exposure gating Meeting count is not enough. The real denominator is comparable exposure. Exposure should include: - number of comparable meetings - total comparable minutes - number of turns involving the person - number of turns involving the dyad - number of detector-relevant opportunities Examples: - interruption continuity needs enough overlapping turn opportunities - unanswered-question burden needs enough actual questions - topic-credit patterns need enough proposal opportunities Recommended exposure fields: exposure_meetings_30d exposure_meetings_90d exposure_minutes_90d exposure_turns_person_90d exposure_turns_dyad_90d exposure_detector_opportunities_90d 4.3 Recency-aware accumulation Use two parallel mechanisms, not one. Mechanism A: windowed summaries Compute bounded summaries for: - last 30 days - last 90 days - last 180 days Purpose: - 30d = recent operational visibility - 90d = main review-support window - 180d = historical persistence / recovery check Mechanism B: online drift statistics Use a recency-weighted stream to detect gradual change. Recommended methods: - EWMA for smooth drift detection - optional CUSUM for small persistent shifts Why: EWMA gives higher weight to recent observations and is well suited to small gradual drift. CUSUM is good for detecting smaller shifts that do not exceed one-meeting thresholds but accumulate over time. Practical recommendation: Use windowed summaries for UI and reporting. Use EWMA/CUSUM-style internal monitors for “pattern emerging” logic. 4.4 Robustness against one weird meeting This is the make-or-break requirement. Do not let one meeting dominate the trend. Recommended controls: A. Per-meeting influence cap Cap the maximum contribution any single meeting can make to a 90-day aggregate. Example: max 20% of total weighted evidence in a 90-day stream from one meeting or winsorize detector-specific z-scores at a fixed bound B. Meeting-size / opportunity normalization A 3-minute exchange and a 90-minute workshop should not have equal influence. Weight by validated opportunity, not only raw event count. C. Outlier flagging, not silent smoothing If one meeting is statistically extreme: - flag it as outlier/high-severity - keep it visible in evidence - do not let it fully rewrite the trend D. Separate “severe one-off” from “persistent pattern” A severe one-off can still matter operationally, but it should not be mislabeled as longitudinal persistence. Keep distinct fields: - severe_single_meeting_flag - persistence_score - drift_score E. Shrinkage toward conservative prior in sparse data Early streams should be pulled toward “uncertain / weak evidence,” not toward dramatic conclusions. Use empirical-Bayes / hierarchical shrinkage or simpler conservative priors in MVP. 5. Recommended scoring model Do not use one flat “risk score.” Use layered outputs. 5.1 Meeting-level detector output For each meeting and each detector: - detector_value_raw - detector_value_normalized - detector_confidence - input_quality - opportunity_count - meeting_weight - comparable_for_longitudinal (true/false) 5.2 Stream-level evidence object For each stream (person, dyad, team) and detector family: - window_30d_value - window_90d_value - window_180d_value - ewma_value - drift_delta - persistence_rate - exposure_score - variance_or_instability - evidence_grade - abstain_flag - abstain_reason_codes 5.3 Composite logic If you insist on a composite, make it second-order and decomposable. Suggested formula skeleton: stream_signal = robust_mean( normalized_meeting_score * opportunity_weight * input_quality_weight * meeting_type_confidence_weight * recency_weight ) Then compute separately: - persistence component - drift component - directional concentration component - confidence component Then only optionally construct: review_support_priority = severity_component x persistence_component x directionality_component x confidence_component Hard rule: Never let confidence hide inside the score. Confidence / evidence grade must be separately visible. 6. Evidence-grade design Recommended evidence grades: A = strong repeated comparable exposure, stable pattern, good input quality B = moderate exposure and consistency C = limited exposure or higher variance D = sparse or confounded X = abstain / insufficient basis Evidence grade should depend on: - comparable exposure volume - number of distinct comparable meetings - detector opportunity count - meeting-type confidence - transcript confidence - diarization confidence - stability across windows - whether the pattern is concentrated in one outlier session - whether the pattern survives confound suppression Example downgrade logic: - fewer than 5 comparable meetings in 90d -> cannot exceed grade C - low diarization confidence -> interruption family max C - low meeting-type confidence -> no review-worthy event - one meeting contributes >20% of weighted evidence -> degrade one level - pattern disappears after role or meeting-type normalization -> abstain 7. Repeated exposure vs random noise Define repeated exposure explicitly. Do not leave this as vibes. A pattern should qualify as repeated only if all conditions below hold: 1. Same stream: same person or same dyad or same team/subgroup 2. Same detector family: e.g. interruption burden, ignored-turn burden, chilling burden 3. Comparable context: same or calibrated meeting type, similar role entitlement, acceptable input quality 4. Enough opportunity: the detector had enough chances to be observed 5. Persistence: seen across more than one meeting or through sustained drift, not one spike 6. Non-fragility: result is not erased by removing one single meeting Practical implementation: Require at least one of: - recurrence across >= 3 comparable meetings, or - sustained EWMA/CUSUM shift across time, or - repeated dyadic directionality beyond threshold And also require: - leave-one-out stability check passes If removing any one meeting destroys the signal entirely, downgrade or abstain. 8. Time-window doctrine Do not treat 30 / 90 / 180 as arbitrary dashboard cosmetics. They should mean different things. 30-day window Use for: - emerging drift - recent change - user awareness - recent self-reflection Not enough by default for strong institutional interpretation unless exposure is unusually high. 90-day window Use for: - main pattern inference - employee private pattern summary - manager mirror trend - default review-support bundle This should be the primary inferential window. 180-day window Use for: - persistence vs recovery - whether correction actually lasted - whether the pattern predates a recent manager change - historical context for investigators under approved procedure Recommendation: Make 90 days the main default. Use 30 days for responsiveness and 180 days for historical anchoring. Do not require all three windows to agree perfectly. A deteriorating recent pattern may only show up in 30d + EWMA before it dominates 180d. 9. Baseline design The baseline stack should be: 1. Own historical baseline within meeting type 2. Own historical baseline within role entitlement 3. Dyad baseline 4. Within-meeting peer comparison 5. Team/environment baseline 6. Optional locale/language-conditioned baseline later This matters because “same raw value” can mean different things: - facilitator interruptions may be normal - trainer airtime dominance may be normal - 1:1 manager talk share may be structurally asymmetric - brainstorm overlap is noisier than standup overlap Without baseline stack, longitudinal aggregation just compounds category errors over time. 10. Drift vs burden: keep them separate Kashi should not collapse these into one dimension. Burden signal “How much asymmetry is this person receiving over the window?” Drift signal “Is the situation worsening relative to their own prior baseline?” Why separate: - chronic low-grade burden may be real even without deterioration - recent deterioration may be critical even if absolute level is still moderate - intervention logic differs Recommended fields: person_burden_90d person_drift_30v90 dyad_directionality_90d team_climate_90d confidence_grade abstain_flag 11. Detector-specific aggregation notes 11.1 Intrusive interruption Strong fit for dyad stream. Aggregate: - rate per opportunity - directional concentration - continuity across meetings Use: - robust count normalization - leave-one-out check - meeting-type suppression where role-entitled interruption is normal 11.2 Chilling delta Very fragile in sparse data. Needs: - good pre/post participation opportunity - enough meeting participation - careful baseline per person Cold-start skip is correct; extend this logic aggressively. 11.3 Floor-time Gini Good team-level climate signal. Weak as person-level accusation. Aggregate as: - team climate index - subgroup compression trend - person share deviation from own-type baseline 11.4 Unanswered-question rate Needs opportunity denominator. Do not aggregate raw counts. Use: - questions asked - response windows - input-quality / semantic-confidence modifier if semantics are involved 11.5 Topic-credit ignored-turns High-value but semantically fragile. Do not let low-confidence topic similarity dominate longitudinal conclusions. Needs its own confidence budget and should not be allowed to “outvote” cleaner structural detectors in sparse settings. 11.6 Agreement asymmetry Potentially useful, but also semantically and contextually fragile. Keep separate confidence and require stronger exposure before escalating. 12. Confidence object design Kashi should stop treating “confidence” as one scalar. Use a confidence object. Suggested confidence object: { transcript_confidence, diarization_confidence, meeting_type_confidence, detector_confidence, exposure_confidence, stability_confidence, anti_confound_confidence, overall_evidence_grade } Why: A dyad interruption stream may have: - high detector logic confidence - low diarization confidence That should downgrade the stream without pretending the whole system is equally certain or uncertain. 13. Abstention policy A serious system needs the power to say “not enough longitudinal basis.” Abstain when: - comparable exposure too sparse - meeting types too mixed - detector opportunity too low - one meeting dominates the trend - input quality too weak - result disappears after normalization or leave-one-out - meeting type unsupported or low-confidence - privacy thresholds prevent safe aggregation Abstention output should still show: - what was observed - why interpretation is limited - what additional exposure would increase confidence Example UI copy logic: Observed: elevated interruption burden in 2 recent comparable meetings. Not shown as a persistent pattern because comparable exposure is still limited and one meeting currently contributes too much of the evidence. That is way better than a fake number. 14. Privacy / retaliation implications of trend windows Trend windows are not neutral. They can leak concern states and identity in small teams. Important rule: Trend-window inspection by the user must not create employer-visible telemetry. Trend-window aggregates shown upward must obey anti-inference rules. Operational implications: - no employer-visible “user checked 30d vs 90d trend” events - small-team suppression for subgroup and dyad views - batching or delay for employer-side summaries - no named subordinate trend browsing by managers - no hidden mirror export into appraisal or discipline workflows Longitudinal aggregation increases inferability because patterns become more identifiable over time. The privacy model must therefore be stricter, not looser, at the cross-meeting layer. 15. Recommended data model additions meeting_table - meeting_id - meeting_type - meeting_type_confidence - internal_external_flag - language_regime - transcript_confidence - diarization_confidence - comparable_group_key detector_event_table - detector_family - actor_id - target_id_nullable - opportunity_count - raw_value - normalized_value - detector_confidence - excluded_from_longitudinal_reason_nullable stream_aggregate_table - stream_type (person/dyad/team/subgroup) - stream_key - detector_family - comparable_group_key - window_30d_value - window_90d_value - window_180d_value - ewma_value - cusum_value_nullable - persistence_rate - drift_delta - exposure_score - instability_score - leave_one_out_fragility - evidence_grade - abstain_flag - abstain_reasons_json - generated_at review_support_object_table - object_id - stream_key - detector_family - review_priority - evidence_grade - supporting_meeting_ids - top_reason_codes - bounded_context_refs 16. Recommended implementation sequence P0 - meeting comparability key - exposure fields - abstention reasons - per-meeting influence cap - leave-one-out fragility check - 30 / 90 / 180 window summaries P1 - EWMA drift monitor - confidence object - evidence grades - detector-specific opportunity denominators - team vs dyad vs person stream separation P2 - CUSUM small-shift detection where useful - empirical-Bayes shrinkage / hierarchical modeling - subgroup streams under privacy controls - deeper multilingual / locale-conditioned priors 17. Recommended algorithm skeleton For each meeting: 1. compute detector outputs 2. normalize by detector-specific opportunity 3. attach input-quality and meeting-type-confidence weights 4. decide whether each output is eligible for longitudinal inference For each comparable stream: 5. gather eligible meeting outputs by detector family and stream key 6. apply per-meeting influence cap 7. compute robust window summaries (30/90/180) 8. compute EWMA and optional CUSUM 9. compute exposure score 10. run leave-one-out fragility test 11. assign evidence grade 12. abstain if rules triggered 13. only then build review-support object if priority and confidence both pass threshold 18. Test plan / acceptance criteria A. One weird meeting does not poison the trend Given a 90-day stream with one extreme meeting and otherwise normal history, the stream should: - preserve the severe meeting as visible evidence - not automatically produce a persistent-pattern object - show degraded stability if that one meeting dominates B. Comparable exposure required Given many meetings with low interaction opportunity, the system should not treat raw meeting count as strong evidence. C. Cross-type pooling blocked Given standups + 1:1s + training sessions mixed together, the system should not produce one unified inferential score without meeting-type normalization. D. Sparse-data conservatism Given fewer than the required comparable opportunities, the system should abstain or downgrade. E. Drift detection works Given small but persistent deterioration across recent comparable meetings, EWMA/CUSUM should detect emergence earlier than simple average-threshold logic. F. Leave-one-out fragility exposed If removing one meeting destroys the signal, fragility should be high and evidence grade low. G. Privacy safe User trend exploration creates no employer-visible event. Small-team upward summaries are suppressed or anti-inference filtered. 19. Critical conclusion The aggregation layer should be treated as a measurement engine, not a reporting layer. If Kashi gets this right: - the product becomes materially more defensible - the system can explain why a pattern is considered persistent - one ugly meeting does not fabricate a fake “trend” - recent deterioration can be detected before the whole 90-day window turns red - abstention becomes a strength instead of a bug If Kashi gets this wrong: - it will either overreact to noise - or smooth real deterioration into irrelevance - or both, depending on which stakeholder is looking. Recommended doctrine in one sentence: Kashi should aggregate review-support evidence across comparable meetings using robust, recency-aware, exposure-weighted, confidence-aware streams at person, dyad, and team levels, with strict abstention when one meeting or weak input would otherwise fake a pattern. Selected references used in this memo Internal Kashi docs - Kashi — Progress & Project Overview (2026-04-21) - Kashi Measurement-Science Research Memo (2026-04-21) - Kashi Meeting-Type Normalization Research Memo (2026-04-21) - meeting_governance_ai_concept_note.docx - Kashi Retaliation-Risk Research Memo (2026-04-21) External - NIST AI 800-3: Expanding the AI Evaluation Toolbox with Statistical Models (official NIST summary page, 2026) - NIST AI 800-4: Challenges to the Monitoring of Deployed AI Systems (official NIST summary page, 2026) - NIST/SEMATECH e-Handbook of Statistical Methods: EWMA Control Charts - NIST/SEMATECH e-Handbook of Statistical Methods: CUSUM Control Charts