Kashi — Confidence / Abstention Perspective Technical research memo for developers Prepared: 2026-04-21 Purpose Turn the confidence / abstention critique into a concrete technical design for Kashi. This is not product poetry. It is a developer-facing recommendation for how the system should represent uncertainty, when it must refuse interpretation, and how that should affect detector logic, aggregation, storage, UI, and rollout. Primary conclusion A serious Kashi cannot just have a single hidden “confidence” number inside a composite score. It needs an explicit confidence architecture and an explicit abstention policy. Otherwise the system will overstate what weak, confounded, or unsupported inputs can justify. The correct posture is: Kashi estimates repeated interaction asymmetry under uncertainty, within comparable meeting contexts, for review support. Not harm. Not harassment. Not intent. Not truth. -------------------------------------------------- 1. Bottom-line technical judgment -------------------------------------------------- Kashi should treat abstention as a first-class feature, not an embarrassing fallback. The current internal research already points the same way: - the measurement-science memo says Kashi becomes much more credible when framed as estimating repeated interaction asymmetry under uncertainty, and says the system should show evidence grade, reason codes, and abstain when evidence is insufficient or confounded. - the meeting-type memo says unsupported or low-confidence meeting types should fall back to observational metrics only and should not generate review-worthy events by default. - the current deck already has a composite score that includes “confidence,” but that confidence is still under-specified. - the current deck also already shows one abstention-like pattern in practice: chilling-delta has a cold-start skip rule when there are fewer than 5 meetings. That is directionally right, but still too weak. Right now confidence is more like a hidden stabilizer than a transparent measurement layer. That is not enough. Kashi needs: 1) confidence objects, plural 2) hard suppression rules 3) watch-mode / observation-only states 4) explicit evidence grades 5) detector-specific reliability logic 6) meeting-type support logic 7) anti-inference / anti-retaliation suppression on the presentation layer 8) auditability for why the system abstained or down-ranked -------------------------------------------------- 2. Why this matters specifically for Kashi -------------------------------------------------- This is not a generic ML-quality problem. It is a category-legitimacy problem. Kashi’s entire defensibility story depends on saying: - the system is structurally explainable - the system is narrower than surveillance tools - the system does not pretend to know more than it knows - outputs are for human review, not automated judgment If the system still emits person-level risk objects under weak transcript quality, weak diarization, sparse exposure, unsupported meeting types, or strong confounds, that whole story collapses. Worse: a weakly-supported output inside an employer-facing governance product is not just “a model mistake.” It becomes pseudo-evidence. That is exactly the failure mode the measurement-science memo warns against. For Kashi, false certainty is more dangerous than silence. A weakly grounded score can: - create overconfidence in reviewers - create false reassurance when the system says nothing meaningful but still emits a low-risk-looking output - create retaliation-sensitive inference if tiny-team outputs become visible - make the product look like a truth machine it is not So confidence / abstention is not a side UX feature. It is part of the legal, trust, and scientific posture. -------------------------------------------------- 3. Hard critique of the current state -------------------------------------------------- 3.1 Confidence exists rhetorically, not architecturally The deck says Layer 5 uses a composite score = severity × repetition × directionality × confidence. That sounds mature, but technically it is still too vague. Missing questions: - Confidence of what exactly? - Confidence in input quality? - Confidence in detector firing? - Confidence in meeting-type interpretation? - Confidence in baseline adequacy? - Confidence in cross-meeting persistence? - Confidence after user-marked confounds? A single scalar cannot carry all that without becoming opaque. 3.2 The current system has at least one real abstention hint, but not a full abstention doctrine The cold-start rule for chilling-delta (<5 meetings = skip) is good. But it is detector-local and narrow. Kashi needs a global abstention doctrine. 3.3 The current deck still has a content contradiction The measurement-science memo is right: the deck says “metadata only / no content,” but some listed detectors imply semantic interpretation. That matters directly for confidence, because constrained semantic detectors need different reliability logic than pure structural detectors. If the detector taxonomy stays muddy, the confidence model will stay muddy too. 3.4 Meeting-type support is still not integrated deeply enough The meeting-type memo is blunt: unsupported or low-confidence meeting types should not produce review-worthy events by default. That means meeting type is not decorative metadata. It is a confidence precondition for interpretation. 3.5 Current evaluation is mechanism proof, not confidence proof 3 harmful seeds + 1 control is fine for a hackathon. It is not enough to justify a mature confidence model. The current eval mostly shows: - the planted pattern fires - one healthy control stays clean - reruns are deterministic It does not yet show: - what happens under degraded transcript quality - what happens under diarization error - what happens in multilingual or L2-heavy settings - what happens in unsupported meeting types - what happens when different detectors disagree - whether human reviewers agree on “review-worthiness” under uncertainty So the correct line to developers is simple: confidence is not a cosmetic extra on top of a finished detector stack. confidence is what determines whether the detector stack is allowed to speak. -------------------------------------------------- 4. Recommended confidence architecture -------------------------------------------------- Do not build one “confidence_score”. Build a confidence bundle. Recommended top-level structure: confidence_bundle = { input_quality, context_support, detector_support, aggregation_support, presentation_support, final_evidence_grade, abstention_state, reason_codes } 4.1 Input-quality confidence This answers: “Can we trust the substrate enough to even interpret the event?” Required fields: - transcript_confidence - ASR confidence summary - word/timestamp coverage - missing-span rate - overlap-heavy segment rate - language-ID stability - diarization_confidence - speaker attribution confidence - speaker-switch stability - unknown-speaker share - name-collision / guest-speaker ambiguity flags - segmentation_confidence - turn-boundary confidence - mid-word truncation detectability - overlap resolution quality - audio_condition_flags - crosstalk-heavy - poor audio quality - clipping - partial transcript - post-hoc imported transcript vs native platform transcript Why this matters: Directional interruption, chilling, and dyadic concentration are all fragile if speaker attribution is wrong. If diarization is weak, Kashi should suppress directional interpretation before it does anything else. 4.2 Context-support confidence This answers: “Even if the substrate is readable, is this a context we know how to interpret?” Required fields: - meeting_type - meeting_type_confidence - meeting_type_supported (boolean) - internal_vs_external - role_schema_completeness - role_schema_confidence - language_regime - single-language / multilingual / code-switching-heavy / L2-heavy - language_support_status - supported / caution / unsupported - confound_flags - facilitator - chair - incident commander - trainer - presenter - new joiner - L2 speaker - self-declared low-speaking preference - confound_burden_score Why this matters: The same interruption rate means different things in a standup, incident bridge, critique, brainstorm, 1:1, training, or executive review. If meeting type is unknown, unsupported, or low-confidence, Kashi may compute metrics but must not construct review-worthy events by default. 4.3 Exposure / baseline confidence This answers: “Do we have enough comparable evidence to interpret this at person or dyad level?” Required fields: - comparable_meeting_count - comparable_minutes_observed - interaction_opportunity_count - dyadic_exchange_count - baseline_window_days - baseline_relevance_score - role-matched baseline availability - meeting-type-matched baseline availability - exposure_sufficiency_state - insufficient / thin / adequate / strong Why this matters: Meeting count alone is fake precision. Ten meetings with no real speaking opportunity may be less informative than three highly interactive comparable meetings. 4.4 Detector-support confidence This answers: “How reliable is this specific detector firing in this specific case?” Each detector should emit: - detector_fired: boolean - detector_strength: numeric - detector_reliability: numeric or tiered - detector_reason_codes: list - detector_suppressed: boolean - detector_suppression_reason: enum Recommended detector-specific logic: A) Intrusive interruption High dependency on diarization and segmentation quality. Suppress when: - speaker attribution weak - overlap segmentation unstable - meeting type unsupported and role-heavy interruption is expected B) Chilling delta High dependency on baseline adequacy and comparable exposure. Suppress when: - insufficient pre-event baseline - sparse prior participation - participant role structurally low-speaking - fewer than baseline minimum comparable meetings C) Floor-time Gini Lower dependence on diarization than directional detectors, but still context-sensitive. Suppress person-level interpretation when: - meeting type is trainer-led / executive review / incident bridge - role schema incomplete - one speaker is expected presenter / trainer / IC D) Unanswered-question rate This is not pure structure if “substantive response” is used. Needs explicit taxonomy: - purely structural variant: no direct response within N turns - constrained semantic variant: response exists but semantically nonresponsive Each variant needs separate reliability logic. E) Topic-credit ignored-turns This is explicitly similarity-based. It is not honestly “metadata only.” Needs: - semantic model version - embedding similarity reliability - language-support gating - multilingual caution - separate semantic-detector confidence channel F) Agreement asymmetry Also requires interpretive caution. Needs: - position-shift detection reliability - role- and meeting-type caution - language-support gating 4.5 Aggregation-support confidence This answers: “Given all detector outputs, do we have enough stable cross-meeting evidence to escalate beyond observation?” Required fields: - detector_agreement_score - cross-meeting_persistence_score - temporal_stability_score - directionality_stability_score - confound_adjusted_persistence - contradiction_flags - e.g. strong Gini but weak dyadic evidence - strong semantic detector but weak transcript quality - overall_interpretability_state Why this matters: The system should not escalate on one loud detector if the rest of the evidence is unstable or contradictory. 4.6 Presentation-support confidence This answers: “Even if the backend believes the signal is real enough, is it safe and appropriate to show it in this channel?” Required fields: - audience_role - team_size - anti_inference_risk - privacy_suppression_required - retaliation_sensitivity_flag - watch_only_required - sharing_restrictions Why this matters: A signal can be technically real enough for a private self-view but still not safe enough for an employer-side or aggregate surface. Confidence is not only epistemic. It is also procedural and audience-relative. -------------------------------------------------- 5. Recommended evidence-grade model -------------------------------------------------- Use a visible evidence ladder. Do not hide everything inside a raw score. Recommended grade set: 0. Blocked Meaning: The system will not compute or will not display interpretive output. Example reasons: - transcript too weak - diarization too weak - unsupported meeting type - protected route / anti-inference suppression 1. Insufficient evidence Meaning: Metrics may exist in backend logs or private low-level view, but no interpretive conclusion should be shown. Typical causes: - thin comparable exposure - sparse interaction opportunity - strong confounds - weak cross-meeting persistence 2. Weak pattern Meaning: Some signal exists, but interpretation is unstable. Allowed outputs: - watch-mode - observational note - no review-worthy event 3. Emerging pattern Meaning: Repeated signal across comparable meetings, but still not strong enough for aggressive interpretation. Allowed outputs: - private pattern narrative - manager self-mirror caution - aggregate “watch” state - no strong institutional escalation by default 4. Stable pattern Meaning: Repeated, comparable, low-confound, multi-detector-supported pattern with adequate input quality. Allowed outputs: - review-worthy event - bounded event object - user-shareable evidence package - governed institutional review trigger if policy allows Optional 5. High-confidence stable pattern Use sparingly. Only for strong structural cases where the system has excellent substrate quality and repeated support. This should not become common. If everything is “high confidence,” the model is lying. -------------------------------------------------- 6. Recommended abstention states -------------------------------------------------- Do not use just “score vs no score.” Use explicit abstention states. Suggested enum: - NO_COMPUTE - COMPUTE_NO_INTERPRETATION - WATCH_ONLY - INTERPRETABLE_PRIVATE_ONLY - INTERPRETABLE_ROLE_BOUNDED Meaning: NO_COMPUTE The substrate is too weak or the feature is out of scope. Example: - diarization totally broken - transcript missing critical spans COMPUTE_NO_INTERPRETATION Raw metrics may be stored, but the system does not generate a narrative or event. Example: - meeting type unsupported - language regime unsupported WATCH_ONLY Show observational telemetry only. No review-worthy event. Example: - slight directional asymmetry but thin exposure - single borderline meeting INTERPRETABLE_PRIVATE_ONLY Show the pattern only to the affected individual or self-mirror owner. Not to employer-side audiences. Example: - emerging pattern with retaliation or inference sensitivity - evidence strong enough for recognition but not for institutional surfacing INTERPRETABLE_ROLE_BOUNDED Allowed to enter bounded governance workflow. Example: - stable repeated pattern with adequate evidence and policy support -------------------------------------------------- 7. Minimum suppression rules Kashi should implement -------------------------------------------------- These should be hard-coded rules, not optional analyst preferences. 7.1 Input-quality suppression Suppress person-level interpretation when: - transcript_confidence below threshold - diarization_confidence below threshold - unknown-speaker share above threshold - overlap-heavy audio above threshold for interruption-based detectors - segmentation confidence too low for truncation logic 7.2 Meeting-type suppression Suppress review-worthy events when: - meeting_type unknown and not inferable confidently - meeting_type unsupported - meeting_type low-confidence and model family not calibrated Observation-only is okay here. Interpretation is not. 7.3 Exposure suppression Suppress person-level or dyad-level interpretation when: - comparable exposure too thin - interaction opportunity too low - one-off meeting dominates evidence - baseline window insufficient 7.4 Role/confound suppression Down-rank or suppress when: - facilitator / chair role explains asymmetry - incident commander / trainer / presenter role explains floor control - self-marked confounds remain unresolved - L2 / multilingual caution state active without calibrated support 7.5 Anti-inference suppression Suppress employer-side display when: - team size too small - one person too obviously reconstructable - timing/context makes identity inferable - review object would indirectly disclose private concern formation 7.6 Protected-route suppression Never create employer-visible signals from: - pattern page open - repeated self-review - confound marking - vault creation - draft preparation - support-link usage These are not review triggers. These are private awareness states. -------------------------------------------------- 8. Recommended UI behavior when the system abstains -------------------------------------------------- The system should not look broken or cowardly when it abstains. It should look disciplined. Recommended user-facing states: 8.1 For self-view “Not enough comparable evidence yet.” “Transcript or speaker-attribution quality was too weak for confident interpretation.” “This meeting format is not yet calibrated for risk interpretation. Raw observations may still be shown below.” “You marked context that may explain part of this pattern. We are down-weighting interpretation until more evidence accumulates.” 8.2 For manager self-mirror “Observations available, but no stable interpretation yet.” “Current data is too thin or context-sensitive for reliable behavioral inference.” 8.3 For employer-side aggregate surfaces Do not show a fake low-risk state. Show one of: - suppressed - insufficient evidence - unsupported context - watch only Important: Absence of interpreted output must never silently imply absence of issue. That would turn abstention into exoneration. -------------------------------------------------- 9. Proposed backend data model -------------------------------------------------- Suggested TypeScript-style interfaces: interface ConfidenceBundle { inputQuality: InputQualityConfidence contextSupport: ContextSupportConfidence exposureSupport: ExposureSupportConfidence detectorSupport: DetectorConfidence[] aggregationSupport: AggregationSupportConfidence presentationSupport: PresentationSupportConfidence finalEvidenceGrade: EvidenceGrade abstentionState: AbstentionState reasonCodes: ReasonCode[] } interface InputQualityConfidence { transcriptConfidence: number diarizationConfidence: number segmentationConfidence: number overlapQuality: number unknownSpeakerShare: number languageSupport: 'supported' | 'caution' | 'unsupported' flags: string[] } interface ContextSupportConfidence { meetingType: string | null meetingTypeConfidence: number meetingTypeSupported: boolean roleSchemaCompleteness: number roleSchemaConfidence: number confoundFlags: string[] confoundBurdenScore: number } interface ExposureSupportConfidence { comparableMeetingCount: number comparableMinutesObserved: number interactionOpportunityCount: number dyadicExchangeCount: number baselineWindowDays: number exposureSufficiency: 'insufficient' | 'thin' | 'adequate' | 'strong' } interface DetectorConfidence { detectorName: string fired: boolean strength: number reliability: number suppressed: boolean suppressionReasons: string[] reasonCodes: string[] } interface AggregationSupportConfidence { detectorAgreementScore: number persistenceScore: number temporalStabilityScore: number directionalityStabilityScore: number contradictionFlags: string[] } interface PresentationSupportConfidence { audienceRole: string teamSize: number antiInferenceRisk: number retaliationSensitivity: boolean privacySuppressionRequired: boolean } Suggested enums: - EvidenceGrade = BLOCKED | INSUFFICIENT | WEAK | EMERGING | STABLE | HIGH_CONFIDENCE_STABLE - AbstentionState = NO_COMPUTE | COMPUTE_NO_INTERPRETATION | WATCH_ONLY | INTERPRETABLE_PRIVATE_ONLY | INTERPRETABLE_ROLE_BOUNDED -------------------------------------------------- 10. Pipeline design recommendation -------------------------------------------------- Recommended execution order: Stage 0 — Ingest Pull transcript, timestamps, diarization, metadata. Stage 1 — Input quality gate Compute transcript/diarization/segmentation/language quality. If below hard threshold, stop interpretation. Stage 2 — Context gate Infer or require meeting type, role schema, internal vs external, language regime. If unsupported or low-confidence, move to compute-no-interpretation or watch-only. Stage 3 — Baseline gate Check comparable exposure and baseline adequacy. If too thin, allow telemetry but block person-level interpretation. Stage 4 — Detector execution Run only detectors allowed under current quality/context state. Each detector emits its own confidence object. Stage 5 — Aggregation Combine only non-suppressed detector outputs. Use agreement and persistence logic. Do not let one strong detector overwhelm contradictory low-confidence conditions. Stage 6 — Evidence grading Map confidence bundle to final evidence grade and abstention state. Stage 7 — Presentation policy Apply audience-specific suppression, anti-inference filtering, and protected-route rules. Stage 8 — Audit logging Store: - what was suppressed - why - which thresholds applied - what was shown to which audience -------------------------------------------------- 11. Recommended detector taxonomy fix -------------------------------------------------- Kashi should stop pretending all listed detectors live in one epistemic bucket. Use an explicit tier system. Tier 1 — Structural detectors Derivable from timing, speaker identity, overlap, adjacency, turn counts, participation counts. Examples: - intrusive interruption - floor-time Gini - response latency (structural variant) - dyadic concentration - chilling delta (if baseline support exists) Tier 2 — Constrained transcript-semantic detectors Use transcript text or embedding/similarity logic but remain bounded and explainable. Examples: - unanswered-question substantive-response logic - topic-credit ignored-turns - agreement asymmetry if position shift is transcript-derived Tier 3 — Forbidden outputs Do not build or claim. Examples: - emotion inference - intent inference - harassment classification - future-behavior prediction - legal conclusions Why this matters for confidence: Tier 2 detectors need their own language support, semantic-model versioning, and failure modes. They cannot inherit Tier 1’s cleaner “metadata-only” confidence story. -------------------------------------------------- 12. Confidence objects Kashi should explicitly expose -------------------------------------------------- If the user asks “why did this fire?” or “why didn’t this fire?”, the system should be able to answer with structured reasons. Minimum exposed confidence objects: - transcript confidence - diarization confidence - comparable exposure sufficiency - meeting-type support - role-schema support - detector agreement - confound burden - evidence grade - abstention state - reason codes Reason code examples: - LOW_DIARIZATION_CONFIDENCE - UNSUPPORTED_MEETING_TYPE - THIN_COMPARABLE_EXPOSURE - HIGH_CONFOUND_BURDEN - STRUCTURAL_ONLY_SUPPORT - SEMANTIC_DETECTOR_LANGUAGE_CAUTION - SMALL_TEAM_PRIVACY_SUPPRESSION - PROTECTED_ROUTE_NO_EMPLOYER_VISIBILITY -------------------------------------------------- 13. How this should affect review-worthy event construction -------------------------------------------------- Current logic: severity × repetition × directionality × confidence Recommended change: Do not let event construction happen unless minimum gate conditions pass. Suggested logic: if abstention_state in {NO_COMPUTE, COMPUTE_NO_INTERPRETATION}: do not construct review-worthy event if abstention_state == WATCH_ONLY: construct observation bundle only, no review-worthy event if final_evidence_grade in {EMERGING, STABLE, HIGH_CONFIDENCE_STABLE}: event eligibility allowed Then apply audience policy. Important: A review-worthy event should not exist merely because one composite number crossed threshold. It should require: - adequate input quality - supported context - sufficient comparable exposure - detector support not dominated by suppressed detectors - acceptable anti-inference / presentation state for that audience -------------------------------------------------- 14. Contestability requirements tied to confidence -------------------------------------------------- Confidence is not complete unless users can challenge it. Users should be able to: - dispute transcript accuracy - dispute speaker attribution - mark role/context confounds - mark meeting structure confounds - request review of interpretation - see if the system suppressed or down-ranked due to their confounds - see access history for protected drill-downs This matters because confidence is partly a product of known-but-not-yet-modeled context. If the user cannot inject context, the system will pretend to be more certain than it is. -------------------------------------------------- 15. Security / logging / privacy implications -------------------------------------------------- Do not let protected telemetry leak into employer-facing analytics. Protected events: - pattern page open - vault create - vault activity - repeated self-review - draft creation - support resource usage If technical logs are needed for security/reliability, separate them from business analytics. They must not be queryable by managers, HR, or governance reviewers. Also log abstention reasons. Why: - helps debug false certainty - helps explain why a case did not escalate - avoids silent “no issue” interpretations - gives reviewers an audit trail of measurement restraint -------------------------------------------------- 16. Validation roadmap specifically for confidence / abstention -------------------------------------------------- P0 Build the confidence bundle schema and abstention enums. No more hidden single confidence scalar. P0 Add hard gates for transcript quality, diarization quality, meeting-type support, and comparable exposure. P0 Split detector taxonomy into structural vs constrained semantic. Stop muddying the measurement story. P0 Implement watch-only and compute-no-interpretation states. P1 Add visible evidence grades and reason codes in private/self-facing surfaces. P1 Build meeting-type suppression / reweight matrices. P1 Add anti-inference suppression in employer-side surfaces. P1 Add protected-route telemetry partitioning. P1 Add user contestability workflow for transcript, diarization, context, and thresholds. P2 Run confidence-specific QA suites: - degraded ASR - wrong diarization - multilingual / code-switching - L2-heavy meetings - tiny-team inference risk - unsupported meeting types - role-heavy meetings (trainer, presenter, incident commander) - detector disagreement cases P2 Run human-review alignment studies on event bundles that include evidence grade and abstention states. Not “is this abuse?” But “is this worth review?”, “is this too weak to interpret?”, “did the abstention make sense?” -------------------------------------------------- 17. Exact project decisions the dev team should take now -------------------------------------------------- Decision 1 Make abstention global, not detector-local. Decision 2 Do not allow unsupported or low-confidence meeting types to generate review-worthy events. Observation-only fallback instead. Decision 3 Stop treating confidence as a hidden scalar in a composite formula. Model it as a bundle with reason codes. Decision 4 Explicitly separate structural detectors from constrained semantic detectors. Their confidence models are not the same. Decision 5 Treat transcript quality and diarization quality as first-order gates, not side metadata. Decision 6 Do not let absence of interpreted output imply absence of issue. UI must say insufficient evidence / unsupported context / watch only. Decision 7 Make protected private states non-visible to employer-side surfaces, including metadata. Decision 8 Force audience-relative presentation policy. Some outputs can be self-view-only before they are institution-view-eligible. Decision 9 Add type-matched evaluation and confound testing before claiming robustness. Decision 10 Rewrite the internal dev spec so the system is described as estimating repeated interaction asymmetry under uncertainty, not detecting harm. -------------------------------------------------- 18. Recommended drop-in wording for the dev spec -------------------------------------------------- Confidence and abstention Kashi does not interpret every computed metric as a scoreable governance signal. Interpretation depends on input quality, meeting-type support, role/context support, comparable exposure, detector-specific reliability, and audience-specific safety constraints. Where evidence is weak, confounded, unsupported, or privacy-sensitive, Kashi will suppress, down-rank, or abstain rather than overstate certainty. Evidence model Kashi outputs are evidence-graded review-support signals, not findings. Every review-worthy event must be backed by a confidence bundle that records substrate quality, contextual support, detector reliability, aggregation stability, abstention state, and reason codes. Abstention rule If transcript quality, speaker attribution, comparable exposure, or meeting-type support is below threshold, Kashi may compute raw observations but must not generate a review-worthy event by default. Unsupported or low-confidence meeting types fall back to observation-only mode. Protected private states Opening a private pattern page, marking confounds, drafting a report, enabling an evidence vault, or reviewing support content must not create employer-visible signals. Security logging for these routes must be technically segregated from employer-facing business analytics. -------------------------------------------------- 19. Final blunt conclusion -------------------------------------------------- If Kashi cannot say “I don’t know” in a technically explicit way, it is not ready. The core issue is not elegance. It is whether the system knows the difference between: - a strong repeated structural pattern - a weak suggestive pattern - a context-confounded pattern - a low-quality-input artifact - an unsupported meeting regime - a real signal that is still too privacy-sensitive to surface institutionally That distinction is the real product. Not just the detectors.