Kashi — Longitudinal Aggregation Perspective
Technical research memo for developers
Date: 2026-04-21

Purpose
Turn the “longitudinal aggregation” question into a concrete engineering and measurement design for Kashi.

Scope
This memo focuses on the aggregation layer that sits above meeting-level detector output. It is written for implementation planning, not pitch copy.

Bottom line
Kashi only becomes meaningfully “Kashi” at the longitudinal layer, but this is also where the system can most easily become statistically sloppy or politically dangerous. The aggregation layer should not be “sum scores across meetings and draw a line chart.” It should be a confidence-aware evidence accumulation system that:
1) aggregates separately by unit (person, dyad, team, subgroup),
2) only compares like with like,
3) weights by exposure and input quality,
4) caps the influence of any single meeting,
5) uses recency-aware drift detection rather than naïve averaging,
6) abstains when comparable exposure is too weak.

Internal anchor
Kashi’s current materials already establish the core shape: longitudinal aggregation over 30 / 90 / 180-day windows, at least per person / per dyad / per team, calibrated to each speaker’s own baseline rather than team average. They also already say a single meeting is noise, a 90-day pattern is signal, and at least one detector already has a cold-start skip rule for fewer than 5 meetings. The measurement-science and meeting-type memos then tighten this further: person-level interpretation requires repeated comparable exposure, meeting count alone is insufficient, meeting-type normalization is part of validity, and outputs should carry evidence grade and abstention when evidence is weak.

1. What the longitudinal layer is actually for

The aggregation layer has five jobs:

A. Convert meeting-level observations into pattern-level evidence.
Meeting-level detectors tell you that something happened in one session. Aggregation tells you whether the same thing keeps happening to the same person, from the same counterpart, under comparable conditions, often enough that it becomes review-worthy.

B. Separate repeated exposure from random noise.
One rough meeting should not produce a person-level risk narrative. The layer should ask:
- Did this pattern recur?
- In comparable meeting types?
- With enough interaction opportunity?
- With enough input quality?
- In the same directional relationship?

C. Detect drift, not only average state.
A person may look “normal on average” while showing a sharp 45-day deterioration. Longitudinal logic must detect both:
- chronic asymmetry (persistently bad)
- trajectory change (getting worse)

D. Prevent overclaiming.
The layer must be where abstention happens. If the system lacks enough comparable exposure, it should compute observations but stop before constructing review-worthy pattern objects.

E. Preserve contestability.
Aggregation must remain decomposable. Every trend must be traceable back to:
- which meetings entered the aggregate
- which meetings were excluded and why
- how each meeting was weighted
- what the confidence or evidence grade was

2. Recommended unit-of-accumulation model

Do not choose one unit. Use a stack.

2.1 Person-level stream
Question answered:
“How is this person’s treatment changing over time?”

Use for:
- employee private view
- self-baseline drift
- speaking-share change
- chilling or unanswered-question burden over time

Do not use alone for:
- accusing a specific counterpart
- inferring cause without dyadic support

2.2 Dyad-level stream
Question answered:
“Does A treat B differently over time?”

This is the most important unit for directional asymmetry.
Use for:
- interruption directionality
- unanswered-question burden from one counterpart
- repeated takeover / credit capture involving the same pair
- manager -> employee asymmetry patterns

The dyad stream is often more probative than the person stream because it tests direction, not just burden.

2.3 Team-level stream
Question answered:
“Is this meeting environment structurally distorting participation?”

Use for:
- floor-time inequality
- subgroup participation compression
- team-level dominance structure
- whether the problem is broader than one dyad

Do not let team-level aggregates wash out targeted harm. A team can look broadly fine while one person is repeatedly suppressed.

2.4 Subgroup-level stream
Question answered:
“Is a class of participants getting systematically worse interaction access?”

Examples:
- juniors vs seniors
- L2 speakers vs native/near-native majority language speakers
- functional subgroup
- externally tagged demographic or protected-category proxies ONLY if legally and ethically validated later; probably not MVP

This should be aggregate-only and heavily privacy-constrained.

2.5 Event-family stream
Question answered:
“Is this detector family recurrent enough to matter?”

Example:
- interruption family
- chilling family
- ignored-turn family

Useful because some detectors are sparse. Aggregating at event-family level can increase stability without flattening everything into one fake composite too early.

Recommended doctrine:
Kashi should aggregate at person, dyad, team, and event-family levels by default.
Subgroup aggregation should exist only where privacy thresholds and governance conditions are met.

3. The core mistake to avoid

Bad design:
Take meeting-level detector scores, average them over 30 / 90 / 180 days, and show the result.

Why this is bad:
- It treats all meetings as comparable.
- It gives a one-off extreme meeting too much power.
- It ignores interaction opportunity.
- It ignores sparse data.
- It ignores input quality.
- It hides whether the pattern is chronic or just recent.
- It looks mathematically clean while being epistemically fake.

Kashi needs accumulation, not mere averaging.

4. Recommended aggregation doctrine

4.1 Comparability gate before aggregation
A meeting may enter a person/dyad trend only if it passes comparability checks.

Minimum comparability fields:
- meeting_type
- meeting_type_confidence
- role schema / role entitlement
- internal vs external
- language regime / multilingual flag
- transcript quality
- diarization quality
- interaction opportunity level

Hard rule:
Cross-type pooling should be prohibited for risk interpretation.
Weekly sync, standup, 1:1, client call, incident bridge, and training session should not enter the same inferential stream as though they were interchangeable.

If meeting_type is unknown or low-confidence:
- allow observational metrics
- block review-worthy pattern construction by default

4.2 Exposure gating
Meeting count is not enough.
The real denominator is comparable exposure.

Exposure should include:
- number of comparable meetings
- total comparable minutes
- number of turns involving the person
- number of turns involving the dyad
- number of detector-relevant opportunities

Examples:
- interruption continuity needs enough overlapping turn opportunities
- unanswered-question burden needs enough actual questions
- topic-credit patterns need enough proposal opportunities

Recommended exposure fields:
exposure_meetings_30d
exposure_meetings_90d
exposure_minutes_90d
exposure_turns_person_90d
exposure_turns_dyad_90d
exposure_detector_opportunities_90d

4.3 Recency-aware accumulation
Use two parallel mechanisms, not one.

Mechanism A: windowed summaries
Compute bounded summaries for:
- last 30 days
- last 90 days
- last 180 days

Purpose:
- 30d = recent operational visibility
- 90d = main review-support window
- 180d = historical persistence / recovery check

Mechanism B: online drift statistics
Use a recency-weighted stream to detect gradual change.
Recommended methods:
- EWMA for smooth drift detection
- optional CUSUM for small persistent shifts

Why:
EWMA gives higher weight to recent observations and is well suited to small gradual drift.
CUSUM is good for detecting smaller shifts that do not exceed one-meeting thresholds but accumulate over time.

Practical recommendation:
Use windowed summaries for UI and reporting.
Use EWMA/CUSUM-style internal monitors for “pattern emerging” logic.

4.4 Robustness against one weird meeting
This is the make-or-break requirement.

Do not let one meeting dominate the trend.
Recommended controls:

A. Per-meeting influence cap
Cap the maximum contribution any single meeting can make to a 90-day aggregate.
Example:
max 20% of total weighted evidence in a 90-day stream from one meeting
or
winsorize detector-specific z-scores at a fixed bound

B. Meeting-size / opportunity normalization
A 3-minute exchange and a 90-minute workshop should not have equal influence.
Weight by validated opportunity, not only raw event count.

C. Outlier flagging, not silent smoothing
If one meeting is statistically extreme:
- flag it as outlier/high-severity
- keep it visible in evidence
- do not let it fully rewrite the trend

D. Separate “severe one-off” from “persistent pattern”
A severe one-off can still matter operationally, but it should not be mislabeled as longitudinal persistence.
Keep distinct fields:
- severe_single_meeting_flag
- persistence_score
- drift_score

E. Shrinkage toward conservative prior in sparse data
Early streams should be pulled toward “uncertain / weak evidence,” not toward dramatic conclusions.
Use empirical-Bayes / hierarchical shrinkage or simpler conservative priors in MVP.

5. Recommended scoring model

Do not use one flat “risk score.”
Use layered outputs.

5.1 Meeting-level detector output
For each meeting and each detector:
- detector_value_raw
- detector_value_normalized
- detector_confidence
- input_quality
- opportunity_count
- meeting_weight
- comparable_for_longitudinal (true/false)

5.2 Stream-level evidence object
For each stream (person, dyad, team) and detector family:
- window_30d_value
- window_90d_value
- window_180d_value
- ewma_value
- drift_delta
- persistence_rate
- exposure_score
- variance_or_instability
- evidence_grade
- abstain_flag
- abstain_reason_codes

5.3 Composite logic
If you insist on a composite, make it second-order and decomposable.

Suggested formula skeleton:

stream_signal =
  robust_mean(
    normalized_meeting_score
    * opportunity_weight
    * input_quality_weight
    * meeting_type_confidence_weight
    * recency_weight
  )

Then compute separately:
- persistence component
- drift component
- directional concentration component
- confidence component

Then only optionally construct:
review_support_priority =
  severity_component
  x persistence_component
  x directionality_component
  x confidence_component

Hard rule:
Never let confidence hide inside the score.
Confidence / evidence grade must be separately visible.

6. Evidence-grade design

Recommended evidence grades:
A = strong repeated comparable exposure, stable pattern, good input quality
B = moderate exposure and consistency
C = limited exposure or higher variance
D = sparse or confounded
X = abstain / insufficient basis

Evidence grade should depend on:
- comparable exposure volume
- number of distinct comparable meetings
- detector opportunity count
- meeting-type confidence
- transcript confidence
- diarization confidence
- stability across windows
- whether the pattern is concentrated in one outlier session
- whether the pattern survives confound suppression

Example downgrade logic:
- fewer than 5 comparable meetings in 90d -> cannot exceed grade C
- low diarization confidence -> interruption family max C
- low meeting-type confidence -> no review-worthy event
- one meeting contributes >20% of weighted evidence -> degrade one level
- pattern disappears after role or meeting-type normalization -> abstain

7. Repeated exposure vs random noise

Define repeated exposure explicitly.
Do not leave this as vibes.

A pattern should qualify as repeated only if all conditions below hold:

1. Same stream:
same person or same dyad or same team/subgroup

2. Same detector family:
e.g. interruption burden, ignored-turn burden, chilling burden

3. Comparable context:
same or calibrated meeting type, similar role entitlement, acceptable input quality

4. Enough opportunity:
the detector had enough chances to be observed

5. Persistence:
seen across more than one meeting or through sustained drift, not one spike

6. Non-fragility:
result is not erased by removing one single meeting

Practical implementation:
Require at least one of:
- recurrence across >= 3 comparable meetings, or
- sustained EWMA/CUSUM shift across time, or
- repeated dyadic directionality beyond threshold

And also require:
- leave-one-out stability check passes
If removing any one meeting destroys the signal entirely, downgrade or abstain.

8. Time-window doctrine

Do not treat 30 / 90 / 180 as arbitrary dashboard cosmetics.
They should mean different things.

30-day window
Use for:
- emerging drift
- recent change
- user awareness
- recent self-reflection
Not enough by default for strong institutional interpretation unless exposure is unusually high.

90-day window
Use for:
- main pattern inference
- employee private pattern summary
- manager mirror trend
- default review-support bundle
This should be the primary inferential window.

180-day window
Use for:
- persistence vs recovery
- whether correction actually lasted
- whether the pattern predates a recent manager change
- historical context for investigators under approved procedure

Recommendation:
Make 90 days the main default.
Use 30 days for responsiveness and 180 days for historical anchoring.

Do not require all three windows to agree perfectly.
A deteriorating recent pattern may only show up in 30d + EWMA before it dominates 180d.

9. Baseline design

The baseline stack should be:

1. Own historical baseline within meeting type
2. Own historical baseline within role entitlement
3. Dyad baseline
4. Within-meeting peer comparison
5. Team/environment baseline
6. Optional locale/language-conditioned baseline later

This matters because “same raw value” can mean different things:
- facilitator interruptions may be normal
- trainer airtime dominance may be normal
- 1:1 manager talk share may be structurally asymmetric
- brainstorm overlap is noisier than standup overlap

Without baseline stack, longitudinal aggregation just compounds category errors over time.

10. Drift vs burden: keep them separate

Kashi should not collapse these into one dimension.

Burden signal
“How much asymmetry is this person receiving over the window?”

Drift signal
“Is the situation worsening relative to their own prior baseline?”

Why separate:
- chronic low-grade burden may be real even without deterioration
- recent deterioration may be critical even if absolute level is still moderate
- intervention logic differs

Recommended fields:
person_burden_90d
person_drift_30v90
dyad_directionality_90d
team_climate_90d
confidence_grade
abstain_flag

11. Detector-specific aggregation notes

11.1 Intrusive interruption
Strong fit for dyad stream.
Aggregate:
- rate per opportunity
- directional concentration
- continuity across meetings
Use:
- robust count normalization
- leave-one-out check
- meeting-type suppression where role-entitled interruption is normal

11.2 Chilling delta
Very fragile in sparse data.
Needs:
- good pre/post participation opportunity
- enough meeting participation
- careful baseline per person
Cold-start skip is correct; extend this logic aggressively.

11.3 Floor-time Gini
Good team-level climate signal.
Weak as person-level accusation.
Aggregate as:
- team climate index
- subgroup compression trend
- person share deviation from own-type baseline

11.4 Unanswered-question rate
Needs opportunity denominator.
Do not aggregate raw counts.
Use:
- questions asked
- response windows
- input-quality / semantic-confidence modifier if semantics are involved

11.5 Topic-credit ignored-turns
High-value but semantically fragile.
Do not let low-confidence topic similarity dominate longitudinal conclusions.
Needs its own confidence budget and should not be allowed to “outvote” cleaner structural detectors in sparse settings.

11.6 Agreement asymmetry
Potentially useful, but also semantically and contextually fragile.
Keep separate confidence and require stronger exposure before escalating.

12. Confidence object design

Kashi should stop treating “confidence” as one scalar.
Use a confidence object.

Suggested confidence object:
{
  transcript_confidence,
  diarization_confidence,
  meeting_type_confidence,
  detector_confidence,
  exposure_confidence,
  stability_confidence,
  anti_confound_confidence,
  overall_evidence_grade
}

Why:
A dyad interruption stream may have:
- high detector logic confidence
- low diarization confidence
That should downgrade the stream without pretending the whole system is equally certain or uncertain.

13. Abstention policy

A serious system needs the power to say “not enough longitudinal basis.”

Abstain when:
- comparable exposure too sparse
- meeting types too mixed
- detector opportunity too low
- one meeting dominates the trend
- input quality too weak
- result disappears after normalization or leave-one-out
- meeting type unsupported or low-confidence
- privacy thresholds prevent safe aggregation

Abstention output should still show:
- what was observed
- why interpretation is limited
- what additional exposure would increase confidence

Example UI copy logic:
Observed: elevated interruption burden in 2 recent comparable meetings.
Not shown as a persistent pattern because comparable exposure is still limited and one meeting currently contributes too much of the evidence.

That is way better than a fake number.

14. Privacy / retaliation implications of trend windows

Trend windows are not neutral.
They can leak concern states and identity in small teams.

Important rule:
Trend-window inspection by the user must not create employer-visible telemetry.
Trend-window aggregates shown upward must obey anti-inference rules.

Operational implications:
- no employer-visible “user checked 30d vs 90d trend” events
- small-team suppression for subgroup and dyad views
- batching or delay for employer-side summaries
- no named subordinate trend browsing by managers
- no hidden mirror export into appraisal or discipline workflows

Longitudinal aggregation increases inferability because patterns become more identifiable over time. The privacy model must therefore be stricter, not looser, at the cross-meeting layer.

15. Recommended data model additions

meeting_table
- meeting_id
- meeting_type
- meeting_type_confidence
- internal_external_flag
- language_regime
- transcript_confidence
- diarization_confidence
- comparable_group_key

detector_event_table
- detector_family
- actor_id
- target_id_nullable
- opportunity_count
- raw_value
- normalized_value
- detector_confidence
- excluded_from_longitudinal_reason_nullable

stream_aggregate_table
- stream_type (person/dyad/team/subgroup)
- stream_key
- detector_family
- comparable_group_key
- window_30d_value
- window_90d_value
- window_180d_value
- ewma_value
- cusum_value_nullable
- persistence_rate
- drift_delta
- exposure_score
- instability_score
- leave_one_out_fragility
- evidence_grade
- abstain_flag
- abstain_reasons_json
- generated_at

review_support_object_table
- object_id
- stream_key
- detector_family
- review_priority
- evidence_grade
- supporting_meeting_ids
- top_reason_codes
- bounded_context_refs

16. Recommended implementation sequence

P0
- meeting comparability key
- exposure fields
- abstention reasons
- per-meeting influence cap
- leave-one-out fragility check
- 30 / 90 / 180 window summaries

P1
- EWMA drift monitor
- confidence object
- evidence grades
- detector-specific opportunity denominators
- team vs dyad vs person stream separation

P2
- CUSUM small-shift detection where useful
- empirical-Bayes shrinkage / hierarchical modeling
- subgroup streams under privacy controls
- deeper multilingual / locale-conditioned priors

17. Recommended algorithm skeleton

For each meeting:
1. compute detector outputs
2. normalize by detector-specific opportunity
3. attach input-quality and meeting-type-confidence weights
4. decide whether each output is eligible for longitudinal inference

For each comparable stream:
5. gather eligible meeting outputs by detector family and stream key
6. apply per-meeting influence cap
7. compute robust window summaries (30/90/180)
8. compute EWMA and optional CUSUM
9. compute exposure score
10. run leave-one-out fragility test
11. assign evidence grade
12. abstain if rules triggered
13. only then build review-support object if priority and confidence both pass threshold

18. Test plan / acceptance criteria

A. One weird meeting does not poison the trend
Given a 90-day stream with one extreme meeting and otherwise normal history,
the stream should:
- preserve the severe meeting as visible evidence
- not automatically produce a persistent-pattern object
- show degraded stability if that one meeting dominates

B. Comparable exposure required
Given many meetings with low interaction opportunity,
the system should not treat raw meeting count as strong evidence.

C. Cross-type pooling blocked
Given standups + 1:1s + training sessions mixed together,
the system should not produce one unified inferential score without meeting-type normalization.

D. Sparse-data conservatism
Given fewer than the required comparable opportunities,
the system should abstain or downgrade.

E. Drift detection works
Given small but persistent deterioration across recent comparable meetings,
EWMA/CUSUM should detect emergence earlier than simple average-threshold logic.

F. Leave-one-out fragility exposed
If removing one meeting destroys the signal,
fragility should be high and evidence grade low.

G. Privacy safe
User trend exploration creates no employer-visible event.
Small-team upward summaries are suppressed or anti-inference filtered.

19. Critical conclusion

The aggregation layer should be treated as a measurement engine, not a reporting layer.

If Kashi gets this right:
- the product becomes materially more defensible
- the system can explain why a pattern is considered persistent
- one ugly meeting does not fabricate a fake “trend”
- recent deterioration can be detected before the whole 90-day window turns red
- abstention becomes a strength instead of a bug

If Kashi gets this wrong:
- it will either overreact to noise
- or smooth real deterioration into irrelevance
- or both, depending on which stakeholder is looking.

Recommended doctrine in one sentence:
Kashi should aggregate review-support evidence across comparable meetings using robust, recency-aware, exposure-weighted, confidence-aware streams at person, dyad, and team levels, with strict abstention when one meeting or weak input would otherwise fake a pattern.

Selected references used in this memo

Internal Kashi docs
- Kashi — Progress & Project Overview (2026-04-21)
- Kashi Measurement-Science Research Memo (2026-04-21)
- Kashi Meeting-Type Normalization Research Memo (2026-04-21)
- meeting_governance_ai_concept_note.docx
- Kashi Retaliation-Risk Research Memo (2026-04-21)

External
- NIST AI 800-3: Expanding the AI Evaluation Toolbox with Statistical Models (official NIST summary page, 2026)
- NIST AI 800-4: Challenges to the Monitoring of Deployed AI Systems (official NIST summary page, 2026)
- NIST/SEMATECH e-Handbook of Statistical Methods: EWMA Control Charts
- NIST/SEMATECH e-Handbook of Statistical Methods: CUSUM Control Charts